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Abstract 

Gene regulation relies on the specificity of transcription factor (TF)-DNA interactions. Limited 
specificity may lead to crosstalk: a regulatory state in which a gene is either incorrectly activated 
due to noncognate TF-DNA interactions or remains erroneously inactive. Since each TF can have 
numerous interactions with noncognate cis-regulatory elements, crosstalk is inherently a global 
problem, yet has previously not been studied as such. We construct a theoretical framework to 
analyze the effects of global crosstalk on gene regulation. We find that crosstalk presents a signif¬ 
icant challenge for organisms with low-specificity TFs, such as metazoans. Crosstalk is not easily 
mitigated by known regulatory schemes acting at equilibrium, including variants of cooperativity 
and combinatorial regulation. Our results suggest that crosstalk imposes a previously unexplored 
global constraint on the functioning and evolution of regulatory networks, which is qualitatively 
distinct from the known constraints that act at the level of individual gene regulatory elements. 


1 


Introduction 


Life depends on the specificity of molecular recognition to ensure that essential reactions only occur 
between cognate substrates even when similar noncognate substrates are present, sometimes in 
large excess. A paradigmatic example is that of the aminoacyl tRNA synthetase (ll, which uses 
kinetic proofreading ||2] to load appropriate amino acids onto matching tRNAs. This and other 
examples—including DNA replication, ligand sensing fS), protein-protein interactions ||ll|5j|3IZl 
IHUll, recognition events in the immune system IITOlITTl and molecular self-assembly IIT2l —indicate 
that biology places a large premium on the reduction of unintended "crosstalk", a generic term that 
encompasses all potentially disruptive processes due to reactions between noncognate substrates. 

Molecular recognition is fimdamental also to transcriptional regulation, the primary mecha¬ 
nism by which cells control gene expression. The specificity of this regulation ultimately originates 
in the binding interactions between special regulatory proteins, called transcription factors (TFs), 
and short regulatory sequences on the DNA, called binding sites. Although each type of TF prefer¬ 
entially binds certain regulatory DNA sequences, a large body of evidence shows that this binding 
specificity is limited, and that TFs bind other noncognate targets as well IIT3iri4llT5l[T6l[T7l . These 
additional binding targets were previously discussed in the context of their effect on the TF con¬ 
centration MM- However, if these sites happen to also be regulatory elements of other genes, 
non-cognate binding not only depletes TF molecules, but could also actively interfere with gene 
regulation. This suggests that the crosstalk problem is global: in a pool of TF molecules of different 
chemical species co-expressed at any one time, each molecule has a small probability of erroneously 
regulating some subset of all genes. As the regulatory system grows in complexity, the number of 
potential noncognate interactions will grow faster than the number of cognate interactions. While 
this makes the problem biologically relevant and theoretically interesting, existing work has mostly 
considered a reduced setting, by computing binding probabilities for a single TF to cognate vs 
noncognate sites 12011211122ll23l . Such a reduced description thus overlooked the effect of this TF on 
the (mis)regulation of genes that were not its cognate regulatory targets. Motivated by this obser¬ 
vation, our primary goal here is to develop a new framework for crosstalk that captures its global 
nature, by simultaneously treating multiple TFs and multiple regulatory binding sites. Moreover, 
the focus of prior work has been on how to achieve reliable gene regulation by cognate TFs 124)1 . 
while the complementary question of how to prevent erroneous regulation by noncognate TFs has 
remained largely unexplored (but see II25I 1. As a result, it remains unclear whether crosstalk places 
strong constraints on the ability of cells to orchestrate their gene expression programs, and to what 
extent different molecular mechanisms could relax any such constraints. 

To address these questions quantitatively, we construct a model of crosstalk in transcriptional 
regulation that satisfies three key requirements for biophysical plausibility. First, the model should 
be global. Global models, where many targets are simultaneously regulated by different TFs, will 
properly capture the faster-than-linear growth in the number of possible noncognate interactions as 
the number of TFs increases, and the difficulty in ensuring that recognition sequences for all TFs re¬ 
main sufficiently distinct. Second, the model should explicitly account for differential activation of 
genes depending on regulatory conditions. Consequently—and in contrast to previously studied 
cases of molecular recognition [21 —the distinction between "erroneous" and "correct" outcomes 
of regulation will depend on the presence / absence of the regulatory signals. In particular, the 
ability of the regulatory system to keep genes reliably inactive when appropriate, despite crosstalk 
interference, will emerge as an important consideration. Third, textbook models of transcriptional 
regulation assume that TF-DNA interactions happen in equilibrium l22l l26l . This assumption, 
which is supported experimentally for prokaryotic regulation I27ll28l and which underlies the ma¬ 
jority of modeling and bioinformatic applications, puts strong constraints on models of crosstalk. 
In this work, we explore its consequences in depth; we report on out-of-equilibrium schemes else¬ 
where 1291 . 

Using our biophysical model we identify the parameters that have a major influence on crosstalk 
severity. While some of these parameters, such as the free concentration of TFs, are difficult to esti¬ 
mate, we show that there exists a lower bound to crosstalk with respect to these parameters. This 
implies the existence of a "crosstalk floor," which cannot be overcome even if TF concentrations 
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were optimally adjusted by the cell, by various feedback mechanisms or otherwise, and compen¬ 
sated for sequestration at noncognate sites. 

Our model allows us to ask a number of fundamental questions: How does the severity of 
crosstalk depend on the number of (co-expressed) genes or the biophysical properties of TF-DNA 
interactions, such as binding site length and binding energy, for which we have reliable estimates? 
How do the regulatory strategies of prokaryotes compare to those of eukaryotes? Do complex 
regulatory schemes, such as combinatorial regulation by activators and repressors, or cooperative 
activation, lower crosstalk, as is often implied Il24l ? 

Many biophysical constraints have been shown to shape the properties of genetic regulatory 
networks, e.g., programmability Il20l , response speed l30l , noise in gene expression and dynamic 
range of regulation |3TJ|32H33l|Ml, robustness Il35l and evolvability of the regulatory sequences l36l 
M- Most of these constraints, however, could be understood at the level of individual genetic regu¬ 
latory elements. Crosstalk, as analyzed here, is special: while it originates locally due to biophysical 
limits to molecular recognition, its cumulative effect only emerges globally. At the level of a single 
genetic regulatory element, crosstalk can always be avoided by increasing the concentration of cog¬ 
nate TFs or introducing multiple binding sites in the promoter. It is only when we self-consistently 
consider that these same cognate TFs act as noncognate TFs for other genes, or that new binding 
sites in the promoter drastically increase the number of noncognate binding configurations, that 
crosstalk constraints become clear. 


Results 

A thermodynamic model of global crosstalk 

We start by introducing a basic model of regulation, in which each gene will be regulated in the sim¬ 
plest possible manner by a dedicated TF t 5 rpe, and the mechanism of regulation will be identical for 
every gene. For this basic model, where the limits to crosstalk are analytically computable, we will 
outline the reasoning, sketch the derivation, and interpret the results in the main text. We will then 
relax our simplifying assumptions in a variety of ways, and extend the analysis to more elaborate 
regulatory schemes, such as different flavors of cooperative or combinatorial regulation. We will 
summarize the corresponding results later in this section and present detailed computations in the 
Supplementary Information. 

We consider a cell that contains M genes, which need to be transcriptionally regulated. In the 
basic model, each gene is associated with a single binding site of length L basepairs, and a unique 
kind of TF, which—in environments where the TF is expressed—preferably binds to that binding 
site to activate transcription. We assume that the genes are inactive, unless a TF binds to their 
binding site. We later relax this simplification to cases where each TF regulates several genes. Every 
TF can also bind other (noncognate) binding sites, albeit with lower probability, as schematized in 
Figs[Tl4, B. These noncognate interactions will contribute to crosstalk in our model. 

We employ a thermodynamic model of regulation 1127112S1 1231 . which postulates that the gene 
expression level depends on the equilibrium occupancy of TFs at the regulatory sites on the DNA. 
This model has been widely used to predict gene expression and has been experimentally validated 
in various systems l39ll40ll4Tl . In this framework, the binding probability of a TF to any binding 
site, cognate or noncognate, is determined by two factors: the effective concentration of TFs, and 
the binding energy. 

We assume that the binding energy only depends on the number of mismatches between a par¬ 
ticular binding site and the consensus sequence unique to the given TF. Each binding site can thus 
exist in either of the three possible states 1381 : (i) boimd by a cognate TE; (ii) boimd by a noncog¬ 
nate TE; or (iii) imbound. Binding of the cognate factor (i) is energetically the most favorable state 
and is assigned the energy E = 0. The imbound state (iii) is usually energetically least favorable 
with energy Ea > 0- Between these two extremes there exist noncognate-bound configurations (ii) 
with intermediate energies that depend only on the number of nucleotide mismatches d between 
the consensus sequence of the TE and the sequence of a given binding site, i.e., E{d) = e d, where 
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Figure 1: Crosstalk in gene regulation. (A) A TF preferentially binds to its cognate binding site, 
but can also bind noncognate sites, potentially causing crosstalk—an erroneous activation or re¬ 
pression of a gene. (B) In a global setting where many TFs regulate many genes, the number of 
possible noncognate interactions grows quickly with the number of TFs; additionally, it may be¬ 
come difficult to keep TF recognition sequences sufficiently distinct from each other. (C) Cells re¬ 
spond to changing environments by attempting to activate subsets of their genes. In this example, 
the total number of genes is M = 4, and different environments (here, 6 in total) call for activation 
of different subsets with Q = 2 genes. To control the expression in every environment, TFs for Q 
required genes are present, while the TFs for the remaining M — Q genes are absent. Because of 
crosstalk, TFs can bind noncognate sites, generating a pattern of gene expression that can differ 
from the one required. 
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e is the energy per mismatch. This mismatch energy model provides a tractable approximation to 
more detailed models Il28l , and has been extensively used in the literature l42ll20l . 

Gene regulation gives cells the ability to differentially activate subsets of their genes in a manner 
appropriate to the environmental conditions, signals, cell type, or time. In our basic model, we 
imagine a cell that responds to different environments by activating different subsets of Q genes 
(out of a total of M genes), while keeping the remaining M — Q genes inactive (see Fig[T]ll). As 
regulation rmfolds, the regulatory network thus switches between equilibrium states where any 
choice of Q out of M genes could be activated; to make the problem tractable, we assume that all 
these choices are equally probable. In a given environment, activating a particular subset of Q out 
of M genes is achieved by expressing the corresponding Q TF types. The remaining M — Q 'IV 
types, corresponding to the genes that should remain inactive, are absent in the cell. 

Ffow does the cell express the correct set of TFs for any particular environment, and at what 
concentrations are these TFs expressed? The issue is made seemingly even more complicated by 
the fact that the TF concentration reflects the total number of TF molecules in the cell, as well 
as any possible effects due to nonspecific TF localization or sequestration on the DNA and else¬ 
where ns Ham. What we will show below is that even if fhe TF presence and concenfrations 
were perfectly adjusted to the environment, a residual level of crosstalk—representing a lower 
bound or intrinsic limit—is inevitable. Since we are interested precisely in this limit, we will not 
need to specify the mechanisms by which cells control their TF concentrations, which likely involve 
complex regulatory network dynamics with feedback loops; instead, we will mathematically look 
for the lowest achievable crosstalk and show that even in an optimal scenario crosstalk can present 
a serious regulatory problem. 

In our model, the crosstalk error can be separated into two contributions that can be computed 
using basic statistical mechanics: 

1. For a gene i that should be active and whose cognate TF is therefore present, error occurs if 
its binding site is bound by a noncognate regulator (activation out of context due to crosstalk), 
or if the binding site is mistakenly unbound (gene is inactive). This happens with probability 


xi{i) 






— tdi. 


Ci + e -t- C. 




( 1 ) 


where Cj is the concentration of the jth TF, dij is the number of mismatches between the jth 
TF consensus sequence and the binding site of gene i, and e the energy per mismatch; all 
energies are measured in units of ksT. Ffere we consider activation by a non-cognate TF as 
crosstalk; reasons for this choice, as well as an alternative model where such cross-activation 
is not considered an error state, are presented in SI Section 4. 

2. For a gene i that should be inactive and whose cognate TF is therefore absent, crosstalk error 
only happens if its binding site is bound by a noncognate regulator (erroneous activation) 
rather than remaining unbound. This happens with probability 


X2{i) 




( 2 ) 


We define the global crosstalk X as the expected fraction of erroneously regulated genes. In our 
basic model where all genes are identically regulated and TFs for genes that need to be activated 
are present at equal concentrations (i.e., Cj = C/Q, where C is the total concentration of all TFs and 
Q is the number of distinct TF species present simultaneously), we show in the SI that the crosstalk 
is 


Q ^M-Q 


(3) 


Global crosstalk X ranges between zero (no erroneous regulation) and one (every gene is mis- 
regulated). 
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Figure 2: Binding site similarity S and number of genes M are basic determinants of crosstalk. 

(A) Binding site similarity, 5(e, L), determines the likelihood that a TF will bind noncognate sites, 
if recognition sequences are of length L and the energy per mismatch is e. A schematic diagram 
of sequence space packing by different TFs: sequences (dots) in a colored circle are likely to be 
bound by the TF whose consensus is the circle's center star. Smaller L contracts the sequence space 
and makes crosstalk (circle overlap) more likely (larger S); crosstalk is increased (larger S) also by 
smaller e, which expands the circle radius. (B) Typical values for the number of genes, M, and 
binding site similarity, S{e, L), across different taxa, estimated from genomic databases. For each 
organism, we find a distribution of S over its reported TFs (dots = median of the distribution, black 
bars = ±l-quartile range; see SI Section 5.4 for details). 
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The major determinant of crosstalk is the likelihood of TFs to bind noncognate sites, which is 
determined by the similarity between cognate and noncognate sites. In the global setting, making 
a particular site less similar to all the remaining sites can only happen at the cost of making the 
remaining sites more similar amongst themselves. For a large number of sites we describe this 
effect by introducing an average binding site similarity measure Si between the binding site of gene i 
and all others, defined as: 


^ = CS,{e, L) « cY,P{d)e-^\ (4) 

d 

where P{d) is the distribution of mismatches between all pairs of binding sites in our model and 
C is the total concentration of all TFs. In the following we assume full symmetry between the 
genes, so that for every i, Si = S. S depends solely on the binding sites, but it carries no fimctional 
meaning in the absence of any TF, namely when C = 0. We emphasize that this quantity, S, is not 
arbitrary, but rather emerges from our calculations in Eqs llll2ll : a related measure of the likelihood 
of olfactory or immune receptors to bind an arbitrary ligand from a large repertoire has been pre¬ 
viously introduced and measured 1441 . Si is proportional to the probability of the f-th TF to bind 
any noncognate binding site. The highest level of similarity. S' = 1, occurs if all sites are identical 
{d = 0). Similarity is very low, S « 0, if the sites are all significantly different from each other. The 
shorter the binding site length L is and the weaker the binding energy e, the larger S gets and the 
less distinguishable the sites are (Fig. |2j\); simultaneously, we expect the crosstalk to increase, an 
intuition we will make precise in the following section. 

Binding site similarity S{e, L) of Eq llS9b could be directly measured, by experimentally probing 
the average TE binding affinity to a large repertoire of known binding sites. Alternatively, S can 
be estimated from biomformatic data. In Fig |2^ we used databases of known TF binding sites 
to extract organism-specific estimates for S. Under certain assumptions about how binding sites 
are organized in sequence space, S can be also computed theoretically. If the binding sites were 
random sequences of length L, one can derive a simple analytical expression for S (see SI): S{e, L) = 
(i I® studied more realistic models for how TF binding sites are organized, e.g., 

taking into account the possibility of TFs to bind reverse-complemented sites (SI Section 5.2); an 
improved biophysical model for mismatch energy that saturates with the number of mismatches (SI 
Section 5.3); and a model of binding sites that have evolved to be maximally distinct (SI Section 5.1). 
All these variations ultimately only affect the value of S while leaving the crosstalk formalism 
unchanged. We therefore carried out our main computations as a function of S directly. To estimate 
typical crosstalk for values of S that are biophysically realistic, we assumed that binding sites are 
distributed as randomly as possible in the sequence space while avoiding excessive similarity (i.e., 
we used the results of Fig S14 with dmin = 2). 

Basic crosstalk model exhibits three distinct regulatory regimes 

While we can reasonably estimate the major determinants of crosstalk in our model—the number 
of genes typically coactivated, Q, the total number of genes, M, and the binding site similarity 
S —it is harder to determine the appropriate value for the total concentration of available TFs, C. 
This is not only due to the lack of quantitative data, but also because the relation between the 
total copy number of TFs in a cell and the concentration of TFs that are available for binding may 
be complicated IflSl . We thus opted for an alternative approach: we look for a concentration C* 
that minimizes the crosstalk error X. An optimal C* emerges as a trade-off between activating 
the Q genes that should be active (for which a higher concentration is beneficial) and avoiding the 
activation of the remaining M — Q genes (for which a lower concentration is beneficial). Such a 
minimum, X* = X{C*), is a lower bound on crosstalk, which can be analytically computed in 
the mean field approximation (SI Section 1), as well as validated numerically by simulation (SI 
Section 2). This level of crosstalk cannot be decreased even if a cell could perfectly adjust its TF 
concentrations to the environment and optimally compensate the concentrations for nonspecific 
binding and sequestration. 
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Figure 3: Basic model with one activator binding site per gene exhibits three distinct regulatory 
regimes. (A) Each binding site can be in either of the three possible states with different corre¬ 
sponding energies: bound by a cognate factor {E = 0, green molecule), bound by a noncognate 
factor with d-mismatches {E = ed, here a blue molecule with d = 2), or unbound {E = Ea, pink 
molecule). The table shows which of these states lead to transcription and which of these outcomes 
is considered as crosstalk when the cognate TF is present and the gene is required to be active (left), 
or if it is absent and the gene is required to be inactive (right). (B) Minimal crosstalk X*, shown 
in color, as a function of the number of coactivated genes Q and binding site similarity, S. Three 
different regulatory regimes are separated by black and white boundary lines (analytical expres¬ 
sions in SI), identical between panels (B) and (C). Dotted lines refer to the "baseline parameters" 
{Q = 2500, M = 5000, log(S') = —10.5 - represents L = 10, e = 2 with dmin = 2.) that we use in all 
subsequent figures if not specified differently. (C) Optimal TF concentration, C*, that minimizes 
the crosstalk, relative to Ci, the optimal concentration at baseline parameters. For high binding site 
similarity (large S'), the crosstalk is minimized at C* = 0 (white region, I: "no regulation regime"). 
For Q ^ M and intermediate S, the crosstalk is minimized at C* —oo (black region, II: "consti¬ 
tutive regime"). In a large, biologically plausible intermediate regime, crosstalk is minimized at a 
finite nonzero TF concentration (color. III: "regulation regime"). 
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First, we consider a fixed number of total genes, M = 5 000, and ask how crosstalk depends 
on the number of co-activated genes, Q, and the binding site similarity, S, in our basic model, 
summarized in Fig|2]4. The optimization yields three distinct regulatory regimes, illustrated in 
Figs|^, C. For larger values of S where binding sites are very similar, regulation is so non-specific 
that crosstalk is minimized by having no TF at all, i.e., at C* =0 (region I). This regime, which 
happens whenever S > 1/{M — Q), is dysfunctional and thus biologically implausible. Interest¬ 
ingly, the resulting fundamental limit to S, or to how similar binding sites can ever get while still 
permitting functional regulation, is set hj M — Q, the typical number of genes that must remain 
inactive in each environment. This highlights the strong constraint on the regulatory system of 
keeping undesired gene activation levels low despite crosstalk interference. 

As the organism tries to activate increasingly large subsets of genes in each environment and Q 
increases, the optimal concentration C* climbs until we reach a regime where C* formally diverges 
(region II), shown in FiglSf and Fig S2. In this limit, however, a biologically plausible solution 
would simply be to constitutively express the majority of the genes rather than using transcrip¬ 
tional regulation to do so, thus avoiding any possible crosstalk interference. This strategy might be 
applicable for organisms living in nearly constant environments, such as obligatory parasites. 

Finally, there is a broad region (region III) in the {S,Q) plane where crosstalk is minimized by a 
finite positive value for the optimal TF concentration. In this regime, which we call the "regulation 
regime" since it corresponds to the biological notion of regulation, crosstalk is given to a very good 
approximation by 

= M + 2^S{M-Q)) . (5) 

This simple expression for X* is one of our key results. It is independent of the energy gap between 
cognate and unbound states, Ea', increasing this gap only lowers the optimal concentration, C*, 
while leaving the crosstalk unchanged. The crosstalk depends both on the fraction of genes that 
need to be activated, Q/M, as well as on the total number of genes that need to be inactive, M — Q, 
in a typical environment. This dependence also suggests that it is costly to maintain genes that are 
never expressed, arguing against unlimited accumulation of obsolete genes in organisms. Crosstalk 
A* in the regulation regime is dominated by the second term of Eq (IS8b . and thus increases as ~ '/S 
and as Q\/M — Q/M for sufficiently small S. At the boundary between regions I and III, where 
regulation breaks down, S{M — Q) = 1, hence X* = Q/M and is independent of S throughout 
region I, because all genes that need to be active are in a crosstalk state due to absence of TFs. 
Alternatively, we can view Eq JS8b as a fimction of M, the total number of genes, at a fixed fraction 
of genes typically activated, Q/M. In that case we can see that the average binding site similarity 
S sets the limit to the maximum number of genes in the organism, M < 1 /S', if the system is to 
stay in region III where regulation is effective. This is confirmed in Fig. SI by a detailed analysis of 
crosstalk for an organism with M = 20 000 genes, a typical number for a metazoan. 

A quick inspection of Fig |3p shows that crosstalk in the basic model is surprisingly high for 
an organism of M = 5 000 genes of which typically a half {Q = M/2) would be activated in each 
environment, and with TF specificity typical of metazoans (log(S') = —10.5). At these "baseline" 
parameters, the crosstalk limit is X* k, 0.23, implying that almost a quarter of the genes at any time 
would be in an erroneous regulatory state. This suggests that global crosstalk is a serious constraint 
and that more complex regulatory mechanisms have evolved, in part, to permit reliable regulation 
despite noncognate TF binding. In what follows, we examine variants of the basic model to assess 
the robustness of our theoretical conclusions and compare, quantitatively, the crosstalk limit for 
different regulatory scenarios at our "baseline" parameter set. These results are detailed below as 
well as in the SI, and are summarized in Table [TJ 

Crosstalk constraints exist also in variations of the basic model 

We first ask whether the existence of regimes where regulation in the basic model is ineffective 
(region I and II) could be an artefact of penalizing expression of unnecessary proteins equally to 
the incorrect expression of the necessary proteins. To study this, we vary the relative contribution 
of the two components of crosstalk error, xi and X 2 from Eqs (Ill2t . to the total crosstalk, X. In 
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Table 1: Comparison of crosstalk levels between the different variants of the model. Baseline 
parameters are: Q = 2500, M = 5000, log(5') = —10.5 - equivalent to an optimal packing model 
where distinct binding sites are different from each other in at least 2 bp (dmin = 2) with binding 
sites of L = 10 bp and binding energy e = 2 ksT per mismatch. 


Model 

Crosstalk 

(at baseline 
parameters) 

Remarks 

Basic model (activators-only) 

0.23 


Basic model (repressors-only) 

0.23 


Mixed model (activators + repressors) 

0.14 

2000 genes expressed in 20% of the env., 

3000 genes in 70%. 

Genes of unequal importance 

0.31 

10% of the genes are important and penalized 

10 X the "normal" rate. 

The resulting error per important gene 
decreases to 0.1, 

but for the other genes increases to 0.33. 

Unequal weights for the two error types 

0.17 

b = 0.5, weight of erroneously-active genes is 
half that of genes that are erroneously inactive. 

Each TF regulates exactly 0 = 10 genes 

0.08 

Also holds for P(0) ~ Poisson(0 = 10). 

Activators + global non-specific repressor 

0.23 

cannot reduce crosstalk. 

Activators + specific repressors (non-overlapping) 

0.2 


Activators + specific repressors (overlapping) 

0.15 

Uses only ~ \fM TF species. 

Perfect AND-gate combinatorial regulation 

0.07 

Generic cooperativity 

0.064 

e.g., dimerization, direct TF-TF contacts, 
TF/nucleosome competition, etc. 

2 bindings sites, each of length L — 10. 

Cooperativity exclusive to cognate binding 

0.006 

currently unknown molecular mechanisms 

2 bindings sites, each of length L = 10. 


Fig. S5 we show that all three regimes reported for the basic model exist generically, although their 
boundaries may shift (see also TableUand SI Section 1.3). 

Next, we ask how our results change if all genes do not contribute equally to the total crosstalk, 
X. We thus split genes into two groups: "important" genes contribute to the crosstalk error more 
than "normal" genes, but—in order to compute the lower bound on crosstalk—we allow TFs of the 
basic model to redistribute optimally between the two groups. Fig. S6 shows that in this scenario 
the crosstalk for important genes can be reduced, but only at a cost of increasing the crosstalk error 
for normal genes. Our theoretical framework can be extended easily to treat multiple heterogenous 
groups of genes. 

Next, we examine the situation where each TF can regulate more than one target gene. Specif¬ 
ically, the cell still contains M genes in total out of which Q need to be activated in each environ¬ 
ment; in contrast to the basic model, each TF now activates groups of 0 genes, which are assumed 
to have identical binding sites (if the sites are not identical, one can show that the crosstalk only 
worsens). In this case the achievable crosstalk is lower than in the basic model, as expected: the 
regulatory network is trading off detailed control over individual genes for crosstalk improvement. 
Surprisingly, however, the crosstalk X* decreases only by a factor of a/ 0 (Table[TJ see SI Section 1.5), 
making it unlikely that crosstalk constraints could be made negligible solely by implementing gene 
regulation at a very coarse level. 

Finally, we modify our basic model to use repression instead of activation to regulate target 
genes. In the basic model, the default state for each gene is to be inactive, with transcription pro¬ 
ceeding only when an activator is bound; in the modified model, the default state for each gene is 
to be expressed, which can be prevented by binding of a repressor. We find a simple mathemati¬ 
cal relation between the crosstalk equations for the basic model and its repressor-only version (SI 
Section 1.2), showing that the repressor-only case exhibits the same three regulatory regimes and 
the same range of crosstalk values. One can also consider mixed models, where activation is used 
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for some genes and repression for the others. Unless symmetry between genes is broken such that 
some genes need to be activated in more environments than other genes, crosstalk is minimized 
by pure strategies (using either only activators or only repressors); mixed strategies can become 
optimal when the symmetry is broken (see SI Section 3). 

Crosstalk is not easily mitigated by complex regulatory schemes 

So far we considered the simplest cis-regulatory element architecture with a single TF binding 
site. Most genes, especially in eukaryotes, employ complex regulatory elements with multiple 
TF binding sites, some of which have been suggested in the literature to increase the effective 
binding specificity of TFs or protect the binding sites from spurious binding Il25ll24l . By implication, 
such effects are expected to also reduce crosstalk. We next use our theoretical framework to study 
quantitatively under what conditions that may be the case. 

Cooperativity. We extend our basic model such that each gene is influenced by two nearby 
binding sites of length L to which cognate TFs can bind cooperatively. For simplicity we assume 
that cooperativity occurs between TFs of the same type, although the framework can be extended 
to more general cases. This molecular configuration of two cognate DNA-bound proteins is fa¬ 
vored by an additional energy contribution A. We assume that only one of the two sites controls 
transcriptional activity directly (here, the site proximal to the gene start, e.g., by polymerase re¬ 
cruitment EZI), while the other - here, the distal site - helps stabilize the binding to the proximal 
site, as schematized in Fig|4]4. In this model, as A goes to zero, the distal binding site has no effect 
on regulation, and we recover the basic model of regulation by a single binding site (FiglSjl. 

To assess whether cooperative regulation can reduce crosstalk, we compute the minimal achiev¬ 
able crosstalk, and compare this in Fig |4^ with the minimal crosstalk of the basic model, 

X*. We find that cooperativity can significantly reduce crosstalk in a large part of the "regulation 
regime," which itself extends towards larger S. Examining in detail how the crosstalk behaves in 
Fig IDE, we see that at a fixed binding site length L, minimization of crosstalk prefers strong co¬ 
operativity A; nevertheless, the improvement in crosstalk is bounded and as A grows, saturates 
at a limiting value. In this limit, crosstalk can approach and even drop below the crosstalk of the 
basic model with a binding site which is twice as long. This is a relevant comparison because co¬ 
operative regulation does, in fact, have access to a total of 2L base pairs of recognition sequence. 
Furthermore, the optimal TF concentration C* required in the cooperative case is lower than in the 
single site case (Fig|4)D), making cooperativity a realistic crosstalk reduction mechanism. 

The crucial assumption of the cooperative model presented above is that cooperative interaction 
between two TF molecules can only occur when they bind their cognate binding sites and never 
otherwise. This is a very restrictive assumption that is unlikely to hold in many documented mod¬ 
els of cooperativity. For example, if cooperative interaction energy A originates in protein-protein 
interactions between the two TF molecules of the same species, this energy will plausibly be gained 
even when these same TF molecules come into contact while binding two nearby noncognate sites. 
Similarly, S 5 mergistic activation l24l or nucleosome-mediated cooperativity 1451 models also imply 
that noncognately-bound factors could contribute towards cooperativity, violating our assumption 
that cooperativity is exclusive to cognate binding. 

To relax this assumption and study the effects of the resulting "noncognate cooperativity," we 
recompute accordingly the crosstalk improvement relative to the basic model, as shown in Fig S19. 
Not surprisingly, we find that allowing cooperative interactions between TFs of the same type when 
bound noncognately leads to much smaller reductions in crosstalk compared to cooperativity that 
is exclusive to cognate binding, as shown in Table[T] When noncognate cooperativity is allowed, we 
can also look at the strong cooperativity (large A) limit and compare crosstalk improvement due 
to two TFs cooperatively binding two sites of length L, to the basic model of a single TF binding a 
site of length 2L. Now, cooperative regulation by two TFs is always inferior to the regulation by a 
single factor with a longer binding site (see SI Section 6). 

Dimerization of TFs is very common among prokaryotes, where TF monomers often dimer¬ 
ize in solution before binding to DNA. If the two binding sites in our model predominantly act 
as half-sites for the binding of a single dimer, the relevant equations for crosstalk are identical to 
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Figure 4: Cooperative regulation reduces crosstalk and the required optimal TF concentration. 

(A) Cognate binding configurations (noncognate not shown) for two sites of length L leading to 
transcription (green check) or not (red cross); doubly occupied promoter gains a cooperative en¬ 
ergy A. Transcription proceeds only when the proximal (rightmost) site is occupied. (B) Differ¬ 
ence in minimal crosstalk, shown in color, between the cooperative model and the basic model 
of Fig|21 -^coop “ cooperative interaction strength A = 10. Cooperativity significantly re¬ 
duces crosstalk (blue; at baseline parameters shown with white dashed lines, = 0.006 here 

vs. X* = 0.23 in the basic model) and shrinks the "no regulation" (C* = 0) regime. (C) Minimal 
crosstalk error, X*, vs. binding site length L for different values of cooperative energy A shows 
that strong cooperativity can decrease the crosstalk beyond the basic model with binding site of 
length 2L (red). (D) Optimal TF concentration, C*, required to minimize crosstalk, decreases with 
increasing cooperativity A for all L, and is consistently below the single-site basic model with site 
length of either L (black) or even 2L (red). Circles denote transition to the "no regulation" (C* = 0) 
regime at low L (large S), showing that cooperativity extends the "regulation regime." In (C)-(D) 
we convert S values to the equivalent binding site lengths L utilizing the random sequence model. 
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noncognate cooperativity in the large A limit, with C being the concentration of monomers. Our 
theory is thus also applicable to this case, although dimerization in solution is often not consid¬ 
ered a canonical example of cooperative regulation. Cooperative interactions conditional on DNA 
binding have been less frequently reported but are also known to occur in prokaryotes (e.g., on 
proximal binding of two dimers); in experimentally documented cases, the interaction energies are 
weaker, A ^ 3 l27l . which still facilitates crosstalk reduction although it is accordingly smaller 

(FigS19). 

The two cases of cooperativity we considered here represent two extremes of a spectrum: co¬ 
operative interaction is either possible exclusively at the cognate site, or at all sites equally. There 
likely exist intermediate situations which help limit the occurrence of spurious cooperative interac¬ 
tions. A simple example of such a mechanism could utilize the positioning of the binding sites on 
the DNA: TF cooperative binding is limited only to pairs of sites which are appropriately spaced. If 
different TF types use different spacing, the harmful effects of cooperativity at a particular noncog¬ 
nate site pair will be restricted to a subset of TFs. More complex geometrical arrangements, e.g., 
cooperative interactions involving DNA looping or allosteric effects between the two TFs and the 
DNA Il46l . could provide similar benefits. While possible in principle, these benefits should be 
considered as hypothetical, since direct experimental support for cooperativity that is exclusive to 
cognate binding is still lacking. 


13 


minimal crosstalk, X* Log [S] 


A 



^ /^r^OOX r\ ^ r^OO^ 

i V/ )l 0(§) n/ 

(DOX 

mm 



D 


Q 



1000 2000 3000 4000 5000 




Log [C/CJ 


14 

































Figure 5: Combinatorial regulation by activators and repressors yields marginal improvements 
in crosstalk error. (A) Separate (left) or overlapping (right) binding sites for activators A and re¬ 
pressors R. A subset of binding configurations for cognate regulators is shown; transcription pro¬ 
ceeds (green) only when the A site is bound by the cognate activator and the R site is unbound. 
(B,C) Difference (shown in color) between minimal crosstalk achievable with activator-repressor 
regulation, and the basic model of Fig|3l With optimal value for the affinity of repressor sites (Er) 
selected in both cases, a small overall improvement in crosstalk error is seen in (B), and a larger 
improvement, but localized to logs' < —10, in (C). At baseline parameters (white dashed lines), 
X* = 0.2 for the non-overlapping case, X* = 0.15 for the overlapping case and X* = 0.23 in the 
basic model. (D) Dependence of the crosstalk on the repressor binding affinity Er (activator affin¬ 
ity fixed at Ea = 15). When Er > Ea, the crosstalk quickly increases: instead of helping prevent 
erroneous activation, repressors themselves bind too frequently in noncognate configurations, ag¬ 
gravating the crosstalk. For non-overlapping sites scenario, Er Ea is optimal, whereas in the 
overlapping sites case, Er = Ea is optimal. (E) Dependence of crosstalk on the total concentra¬ 
tion, C, of transcription factors, for non-overlapping sites case (orange-brown curves representing 
different Er, as indicated) and overlapping sites case (green curves representing different Er, as 
indicated). The total concentration is optimally split between activators and repressors for each C, 
and is reported relative to the optimal concentration Ci of the basic model. 


Combinatorial regulation by activators and repressors. An important contribution to crosstalk 
is the erroneous activation of genes that should remain inactive. One might argue that any kind 
of global repression could alleviate this problem by preventing spurious transcription. We ex¬ 
plored this scenario by extending our basic model to include an additional nonspecific repressor 
(SI Section 8). Perhaps not surprisingly, we find that the minimal achievable crosstalk error in this 
extended scheme is exactly the same as in the basic setup, regardless of the concentration and the 
affinity of the sites. 

We next turned our attention to a sequence-specific repression mechanism. In an extension to 
our basic model, we equipped each gene with both an activator and a repressor site, such that each 
of these sites has its own cognate regulator (activator or repressor). For the Q genes that should be 
active, only their Q cognate activators (but not repressors) were present. For the remaining M — Q 
genes that should be inactive only their cognate repressors (but not activators) were present. Re¬ 
pressor sites could have a different affinity {Er) than the activator sites (Ea). To look for the minimal 
achievable crosstalk, we optimized over the concentration of activators, repressors, and the affin¬ 
ity Er- Importantly, we considered two possible molecular arrangements on the promoter: in the 
non-overlapping sites scenario (Fig |^, left) the two binding sites could be occupied by regulatory 
molecules simultaneously, whereas in the overlapping sites scenario (Fig|5]\, right), either the acti¬ 
vator or repressor site, but not both, could simultaneously be occupied. Whether this exclusion 
happens because the two binding sites literally overlap or due to more complex mechanisms is not 
crucial for our results. We assumed that a bound repressor inactivates transcription, regardless of 
the activator state; for a detailed list of molecular configurations on the promoter, see SI Section 9. 

In the non-overlapping case, small (~ 10% at baseline parameters) decreases in crosstalk error 
are nominally possible, as shown in Fig|5p. A detailed examination, however, argues against this 
mechanism for crosstalk reduction. Optimization in Fig|5jD assigns the repressor sites a very weak, 
or even vanishing, affinity for the TFs, Er Ea'. in essence, the repressor sites energetically favor 
staying empty to the same amount as binding a cognate repressor, to fight off noncognate bind¬ 
ing. As a costly consequence, the optimal concentration of the required TFs needs to be larger by 
an unreasonable factor, ^ 20 000-fold, relative to the basic model, to achieve this small crosstalk 
reduction gain. 

The overlapping case provides a greater crosstalk reduction (^ 35% at baseline parameters), as 
shown in Fig|5ld. The optimal repressor sites have similar affinity to their cognate TFs as do the 
optimal activator sites, Er ~ Ea', the benefit of the repressors quickly vanishes if this condition is 
not met. The total required regulator concentration now no longer has a clearly defined optimum, 
but does exhibit a plateau where the crosstalk is minimized. Importantly, as shown in Fig|5fl, this 
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plateau is reached for concentrations only somewhat higher than in the baseline case, making this 
solution biologically plausible. 

In sum, the case for combinaforial regulation by activators and repressors is complicated. Com¬ 
binatorial regulation provides a smaller absolute improvement than cooperativity, but this im¬ 
provement is also centered around smaller values for binding site similarity, log(S') < —10, where 
the crosstalk of fhe basic model is ifself already lower. In contrast to our initial expectation, this 
small gain is realistically achievable only with one of the two regulatory schemes considered, and 
only when its parameters are properly tuned. 

AND-gate combinatorial regulation. Lastly, we considered the simplest AND-gate regulation 
scenario. The expression state of each gene is defermined by fhe occupancy of two binding sites; 
in particular, activation is achieved by binding of a precisely specified, unique pair of cognate acti¬ 
vating TFs. Crucially, in the "perfect combinatorial regulation" scenario, ^/2M TF species (instead 
of M, as in fhe basic model) are sufticienf to specifically regulate any subset of the M genes. As 
we show in SI Section 7 and summarize in Table [TJ this leads to a sizeable crosstalk reduction. Us¬ 
ing v^2M TF species means, on average, 0 = M/ \/2M regulated genes per TF. If sefs of 0 genes 
were regulated jointly by a common TF, crosstalk should decrease as ~ v^, as we argued above. 
Figure S21 shows that for the AND-gate the decrease is somewhat smaller, but unlike in the simple 
scenario where each TF regulates groups of 0 genes with no possibility of confrol over individual 
genes, the AND-gate allows each gene to be regulated individually. While this combinatorial strat¬ 
egy allows crosstalk reduction and has been documented at specific promofers, we point out that 
the predicted, square-root scaling of fhe number of TF species wifh fhe tofal number of genes, M, is 
inconsistenf wifh published reports f47ll48ll , making it unlikely that crosstalk reduction is achieved 
through genome-scale combinatorial control as analyzed here. 


Discussion 

Finite specificity of recognition reactions is a fact of life at the molecular scale. In transcriptional 
regulation, which takes place in a mix of cognate and noncognate transcription factor species, the 
consequences of this fact could be severe—but have surprisingly not been taken to their logical 
conclusion so far. Ffere, we consfrucfed a theoretical framework for crossfalk that accounts for all 
possible cross inferactions befween regulafors and fheir binding sites. This global model enabled 
us to compute the lower bormd on crosstalk and assess the effectiveness of various regulatory 
schemes. We derived limits to reliable gene regulation that depend only on the total number of 
genes M, fhe typical number of co-acfivated genes, Q, and the average level of similarity between 
pairs of binding sites, S. 

We find thaf these parameters robustly define three possible regulatory regimes. A nonzero TF 
concentration that minimizes crosstalk exists only when binding sites are sufficiently distinguish¬ 
able from each ofher and fhe fypical number of co-acfivafed genes is nof exfreme. We call fhis 
the "regulation regime." The other two regimes are anomalous cases where regulation is dysfunc¬ 
tional. Looking closely at the boundaries between the three regulatory regimes, we find thaf fhe 
average similarify befween binding sifes, S, pufs an upper bound to the total number of genes thaf 
an organism can effectively regulate Il49l . 

An analogous problem exists in protein-protein interaction networks, where protein function 
requires strong binding to a few partner-proteins but avoidance of binding to all the others |IZl|6l. 
Previous works have studied the evolution of such networks by applying a combination of posi¬ 
tive and negative design using computer simulations, concluding that "negative design" seriously 
constrains the possible architectures IBUIISTI1521 151. As a quantitative measure for the likelihood of 
specific vs. nonspecific inferactions, Johnson et al used the minimal energy gap between specific 
and nonspecific interactions, in analogy to our measure of binding sifes similarify S. They foimd 
a power-law scaling of the energy gap with the total number of proteins in the network and also 
found thaf it depends inversely on the size of binding surface, L - bofh results are in qualitative 
agreement with ours for the total number of genes M and length of the binding sites L. Similarly, 
a larger binding domain was found to enable a larger number of specific inferactions in a protein 
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mixture when other nonspecific interactions are excluded IISTlI . Johnson et al also foimd that net¬ 
work designs in which some proteins have multiple specific partners (“hubs") have higher crosstalk 
compared to networks with only pairwise interactions. At this point protein-protein interaction 
networks significantly differ from TF-DNA interactions: if multiple binding sites share a common 
TF, these binding sites carmot bind each other, as would be the case for different protein species 
interacting with a common hub. Zhang et al identified a trade-off between proteome diversity and 
concentration due to crosstalk considerations, concluding that the numbers found experimentally 
are close to the possible limit HZ). Protein concentrations face trade-off: they should be high enough 
to form specific interactions, but not so high as to form many non-specific ones. The opfimal TF 
concentration in our model is determined by a similar trade-off. Analogous problems due to ex¬ 
plosion of non-cognate configurations were studied in the context of prebiotic metabolism Il53l and 
the immune system, where receptors are selected to recognize foreign peptides, but avoid bind¬ 
ing self-peptides Il54ll . In the context of TF-DNA interactions Sengupta et al II 2 TII studied how the 
evolutionary mutation-selection balance tunes TF specificities to its DNA targets and how this de¬ 
pends on the number of targets. They identified a trade-off between avoiding the loss of current 
targets (for which a lower specificity is favored) and avoiding the spurious recruitment of new ones 
(for which a higher specificity is favored); they also report an inverse relation between the number 
of different targets and the TF specificity for each. An intriguing direction for future research is to 
explore how crosstalk might limit the complexity of regulatory networks in an evolutionary setting. 

Table 2: Comparison of relevant parameters and crosstalk values between prokaryotes and eu¬ 
karyotes. 



prokaryotes 

eukaryotes 

binding site length 

10-20 bp 

6-10 bp 

binding site similarity, S 

-20<log(S')<-13 

-15 < log(S') < -9 

number of genes, M 

a few thousands 

5,000 - 20,000 

crosstalk in the basic model 

1% - 10% 

20%-50% (depending on M) 

crosstalk with cooperative regulation 

< 1% 

~ 10% 


Where do real organisms find themselves in this parameter space? Prokaryotes tend to have 
longer binding sites and fewer genes than eukaryotes. In Table|2]we present typical biophysical pa¬ 
rameters for each and the resulting crosstalk estimates. While for prokaryotes we expect crosstalk 
to easily be between 1% and 10% even if each gene is regulated by a single site, and below 1% for 
biophysically realistic cooperative regulation, for eukaryotes the situation is significantly different. 
Even for a short genome of M = 5 000 genes, such as yeast, or for longer genomes of metazoans 
where most of the genes have been non-transcriptionally silenced, we expect minimal crosstalk of 
X* = 0.23. In an organism with M = 20 000 regulated genes crosstalk would increase substan¬ 
tially according to the basic model: more than 40% of all genes would be erroneously regulated. 
Incorporating known constraints on the biophysics of TF-DNA interaction (Figs. S16, S17) increases 
crosstalk even further and pushes metazoan regulation towards the anomalous regime. 

Complex regulatory schemes increase the specificity of gene regulation by cognate factors, and 
high specificity was tacitly assumed to provide automatic resilience against crosstalk. In contrast, 
our analysis of several complex regulatory mechanisms reveals a more intricate picture. We fo¬ 
cused on two broad classes of regulatory mechanisms. The first class comprises various schemes 
of cooperative regulation. Cooperativity can lower crosstalk because it effectively increases the 
binding site length and energy and thus reduces binding site similarity. We found that the effec¬ 
tiveness of cooperativity for reducing crosstalk crucially depends on the strength of the coopera¬ 
tive interaction and on whether cooperative interactions are restricted exclusively to cognate sites. 
With respect to cooperative interaction strength, the optimal crosstalk reduction happens at very 
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strong cooperativity, but this might be hard to realize biophysically. Commonly reported values 
are indeed small (3 — 5 ksT), comparable to the energetic contribution of only 1 — 2 bp in the 
TF-DNA interaction 12711281 . With respect to cooperative interactions being exclusive to cognate 
binding, such regulatory schemes, while optimal for crosstalk reduction, would require additional 
sequence recognition mechanisms, and it is unclear to what extent they exist or how effective they 
are. If cooperafive inferactions can occur at non-cognate sites as well, as is the case for most doc¬ 
umented mechanisms of cooperafivify, ifs effectiveness in mitigating crosstalk is significantly di¬ 
minished. The second class of mechanisms we considered relies on combinatorial regulation by 
multiple TFs. As a representative example we studied combinatorial regulation by activators and 
repressors. Contrary to the common expectation that repression should eliminate spurious gene 
activation ESI ED, we foimd various mechanisms to be either ineffective (global repression) or 
providing marginal global improvement at best (activator-repressor regulation with overlapping 
binding sites). While crosstalk can indeed be mitigated for particular gene(s) by employing a com¬ 
plex promoter architecture, this inevitably comes at a cost for the regulation of other genes. The 
intuitive explanation for fhe limited benefit of combinaforial schemes is that adding new regulatory 
components—in this case, repressors and their respective binding sites—drastically increases the 
number of possible noncognate interactions, thereby potentially aggravating, instead of mitigaf- 
ing, the crosstalk problem. A similar detrimental effect due to growth in the number of undesired 
configurafions wifh fhe number of molecular species has been reported in the study of molecular 
self-assembly |[T2l . A pofentially powerful set of mechanisms are fherefore schemes in which com¬ 
binatorial regulation is used primarily to decrease the required number of molecular species, as in 
the simple AND-gate example we explored in SI Section 7. Further work is needed to fully eluci- 
dafe crosstalk limits in more general models of combinatorial control and cooperativity, with in¬ 
teresting parallels to precision in biochemical sensing, in equilibrium as well as out-of-equilibrium 
scenarios II551E1I561E21. 

An interesting result of our sfudy is fhat various schemes of molecular control logic at promoters 
and enhancers m, while nearly equivalent in the absence of crossfalk, can behave very differenfly 
in the presence of noncognafe regulators Il58l . For example, the issue of cooperative interactions 
during noncognate binding is a striking demonstration of how a seemingly microscopic detail may 
influence global crossfalk, while if has no bearing on fhe aspecfs for which cooperativity has been 
studied traditionally: its ability to sharply activate the cognate gene in response to small increases 
in TF concentration. A similar remark applies for fhe case of overlapping vs nonoverlapping bind¬ 
ing sifes in fhe combinaforial regulation scenario. By going beyond mean-field approximations, 
this could be extended to biologically relevant situations where pairs of binding sites overlap so as 
to share large sequence fragments Il59l . Clearly, there is a need to further understand signal pro¬ 
cessing at complex promoters Il60l . and calls for experimental measurements of crossfalk in various 
regulatory architectures. 

Direct measurements of crossfalk are challenging precisely because crosstalk is a global effect 
and experimentally influencing noncognate binding in a controlled marmer is difficult. An alterna¬ 
tive approach would be to search for indirect signatures of crosstalk IbTl . A promising line of re¬ 
search supported by a large body of recenf experimental evidence would be to examine “pervasive 
transcription" in eukaryotes flSl l62l as a proxy for erroneous initiation, perhaps due to crosstalk 
interference. 

Taken together, our findings suggesf thaf global crosstalk represents a strong constraint in eu¬ 
karyotic regulation that can be mitigated, but not easily removed. Initially, this conclusion was 
based on a greatly simplified model of gene regulation. We succeeded in relaxing many of our 
assumptions only to find fhat crosstalk constraints remain significant. This is because the major 
determinant of crossfalk is the binding site similarity S, which primarily depends on the typical 
mismatch energy e and the length of the binding sites, L. While crosstalk could be reduced by ex¬ 
tending binding site length and/or augmenting the binding energy, both parameters are severely 
constrained by a combination of biophysical and evolutionary factors. The scale of the mismatch 
energy is set by the energetics of hydrogen bonds fo ~ 2 — 4 ksT, while fhe lengfh of individual 
binding sites in eukaryotes appears strongly constrained by evolutionary considerations to ^ 10 
bp EH Elisa. Moreover, the performance of complex regulafory schemes, which appear ben- 
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eficial at first glance, is also limited by the explosion of possible noncognate configurations that 
may lead to erroneous regulation. These constraints should apply universally, beyond the specific 
mechanisms we analyzed in detail: any regulatory scheme operating at equilibrium, no matter how 
complex, faces a fundamental limit to its achievable error, for reasons that led Hopfield to propose 
kinetic proofreading ||2|- 

The main conclusion of our work is that crosstalk in gene regulation is far from being a solved 
problem. We find several commonly studied regulatory mechanisms to be insufficient for elimi¬ 
nating crosstalk in metazoans, at least when acting alone. While it is theoretically possible that a 
combination of equilibrium mechanisms acting in unison could achieve low crosstalk levels, this 
possibility is by no means obvious and indeed appears unlikely. Alternatively, cells might have 
evolved out-of-equilibrium solutions where energy is deliberately spent to counteract the detri¬ 
mental effects of crosstalk; example mechanisms could include permanent gene silencing, localiza¬ 
tion of transcriptional activity to specific cellular compartments, or molecular reaction schemes for 
gene regulation that implement variants of kinetic proofreading II29I . 
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1 Basic model - analytical solution 

We assume that the genome of a cell contains M "target" genes, each of which is regulated by a 
single unique transcription factor binding site (BS). In the basic formulation, there exist also M 
distinct TF types, such that each TF can preferentially activate its corresponding target gene by 
binding to its binding site. At any point in time, however, not all M TF types are present: we 
assume that only subsets of size Q < M are presenf af some nonzero concentration, and that the 
optimal gene regulatory state for the cell would be to express exactly and only those genes for 
which fhe Q corresponding TFs are present. 

Let regulation be determined by the (mis)match between the binding site sequence and the 
recognition sequence of any transcription factor. Each binding site is associated with a single TF 
type with which it forms a perfecf mafch - fhis is the cognate TF for the given binding site. How¬ 
ever, each site could also occasionally be bound by other (noncognate) TFs, at an energetic cost of 
a cerfain number of mismafches. Following earlier works Il38ll20l . we assume thaf the contribution 
of mismatches at individual positions in a binding site to the binding energy is equal, additive, 
and independent. We define the energy scale such that binding with cognate TF has zero energy 
and all other binding configurations have positive energies, proportional to the number of mis¬ 
matches d, E = ed, where e is the per-nucleotide binding energy. The unbound state has energy 
Ea with respect to the cognate bound state. The different states and their energies are illustrated in 
Fig. 3A in the main text. We employ a thermodynamic model to calculate the equilibrium binding 
probabilities of cognate and noncognate factors to each binding sequence. 

TFs can also be non-specifically bound to the DNA. These configurations only sequester TFs 
from free solution, but do not directly interfere with gene expression. As explained later, we will 
lump together the TFs freely diffusing in the solution, as well as nonspecifically bound TFs and any 
other TF "reservoirs" into one effective concentration of available TFs (equivalently, we work with 
the chemical potential of the available TFs using the grand-canonical ensemble). 

Previous studies calculated the probability of a given franscription facfor fo be bound or un¬ 
bound fo certain DNA sequences Il20l . These probabilities were calculated assuming that the site is 
vacant or bound by the TF under study, but not bound by TFs of ofher types. This approach is cum¬ 
bersome when a large number of TF t 5 qres are considered simultaneously, because the probability 
that the site is bound by other factors is non-negligible, and due to steric hinderance, a site carmot be 
bound by more than one molecule at any given time. Previous studies also proceeded by using the 
canonical ensemble. These two modeling choices together make the problem of many TFs binding 
fo multiple binding sites coupled and not easily tractable, because one would need to enumerate 
all possible combinations of TF-BS sfates. However, an alternative and much simpler approach is 
to employ the grand-canonical ensemble, and calculate the binding probabilities for the binding 
sites, rather than for fhe TFs. The necessary assumption is thaf binding sites behave independently 
(e.g., they are sufficiently separated on the DNA so that binding at one site does not overlap the 
binding at another, or if if does, this is treated explicitly). Underlying the grand-canonical ensemble 
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is the assumption that TFs are present at sufficient copy numbers, so that the binding of a single 
site under consideration does not appreciably affect the chemical potential of the remaining TFs. 
Experimental support for such decoupling and the applicability of the grand-canonical approach 
has been demonstrated recently 1^ . In the following we assume equal concentrations of all TF 
types. 

We distinguish two contributions to crosstalk: 


1. For a gene i that should be active and whose cognate TF is therefore present, error occurs if its 
binding site is boimd by a noncognate regulator (activation out of context due to crosstalk), 
or if the binding site is unbound (gene is inactive). This happens with probability 






(S6) 


where Cj is the concentration of the jth TF, dij is the number of mismatches between the jth 
TF consensus sequence and the binding site of gene i, e the energy per mismatch and Ea the 
energy difference between unbound and cognate bound states; all energies are measured in 
units of ksT. 


2. For a gene i that should be inactive and whose cognate TF is therefore absent, crosstalk error 
only happens if its binding site is boimd by a noncognate regulator (erroneous activation) 
rather than remaining unboimd. This happens with probability 
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In general xi ^2 depend on the specific set of pair-wise distances dij between the consensus se¬ 
quence of each TF present and the site of gene i. Hence they could vary between genes, and even 
for each gene different sets of TFs can yield different values of crosstalk. In the following we as¬ 
sume a fully symmetric setup, such that all genes are equivalent in their sensitivity to crosstalk (a;i ,2 
is independent of i). We assume that for each gene the mismatches dij of all the noncognate TFs 
are distributed according to a probability density p{d) (independent of the gene). For a particular 
gene i, clearly different sets of TFs provide different pairwise distances dij. However, for Q ^ 1 the 
fraction of sets of same size Q that yield distances which are distributed very differently from p{d) 
is small. In the following we neglect this fraction and assume that all choices of Q TFs yield exactly 
the same crosstalk contribution xi^ 2 {Q, M); this mean-field assumption is explicitly validated by 
numerical simulations in SI Section |2l We will also consider that all possible sets of Q TFs (sets of 
genes that need to be active) are equally likely to occur. 

See SI Section|4]for the alternative definitions of xi and X 2 - 

Our next step is to calculate total crosstalk as a function of the above parameters (the total 
number of binding sites M and the number of TF types available at any given time Q). We define 
total crosstalk as the fraction of genes found in any of the possible erroneous states. We assume 
that the particular choice of Q TFs that are present is random (hence we average over all possible 
ways to choose Q out of M TFs). In reality only certain sets of TFs need to be active together in 
which case the genes that are co-activated could have mutually similar binding sites, especially if 
they were regulated by the same TF, compared to genes that are activated separately, possibly by 
different TFs. In SI Section 11.51 we treat a simple extension of our model where each TF can co¬ 
regulate several target genes. We also assume equivalence between the two types of error (we relax 
this assumption below in SI Section lhSll . 
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Clearly, if each of the Q genes that should be active has probability xi to be in any of the crosstalk 
states, then the expected number of genes in that state is Qxi. Similarly, of the genes that should be 
inactive the expected number that are in crosstalk state is (M — Q)x 2 - To obtain the fraction of genes 
in any of the crosstalk states we simply divide by the total number of genes M: 

X{Q,M,xi,X 2) = Xi^ + (S8) 

Using the definition of S introduced in the main text 

=^(Q-l)^P(d)e-^"«C^P(d)e-^‘' = ra,(e,L), (S9) 

j^i d d 

where we approximated Q — 1 ^ Q which is valid for Q ^ 1 (an assumption we make here and 
throughout the paper). S{e, L) is an average similarity measure between all pairs of binding sites. 
If binding site sequences are drawn randomly from a uniform distribution, S = . This 

is easy to derive: since individual base pairs are assumed to be statistically independent, at each 
position the probability of a random sequence to be identical to a given TF consensus sequence is 
1 /4, whereas with probability 3/4 it is different, implying a decrease of e~^ in binding energy. Since 
the complete binding site consists of L independent base pairs, this expression for a single base pair 
is now raised to the power of L. 

The expressions for xi ^2 read: 


Xi 


X2 


e-Ea + CS 

C + + CS 

CS 

e-^-+CS' 


(SlOa) 

(SlOb) 


The two extreme cases occur when TF concentrations are either zero or very large (Table |3). If 
C = 0, a;i = 1 and X 2 = 0, i.e., xi is maximal due to binding sites that should be bound, while zero 
error for X 2 occurs due to binding sites that should be unbound. The total error then amounts to 
the fraction of genes that need to be activated X{C = 0) = Q/M. At the other extreme, if C — oo, 
xi = SQ/{1 + SQ)) and X 2 ~ 1, i.e., no site is left unbound. The magnitude of xi error due to 
noncognate binding is determined by the binding site similarity S. If QS <C 1, xi ~ QS — (QS)^. 
The total crosstalk then amounts to X{C —>■ oo) = 1 — If SQ <C 1, X Ri 1 — ^(1 — SQ). 

Next, we analyze the dependence of crosstalk on various parameters. One unknown in these 
expressions is the TF concentration C. Because we are searching for a lower bound on crosstalk, we 
can find the concentration that minimizes X. Taking the derivative of X and solving for its zeros. 


dC 


X{Q,M,xi,X2) = 0, 


we find two potential extrema 


C * _ 

1.2 — 


ge-®“ (^S{SMQ - Q{SQ + 2)+M)± ,JS{M-Q)^ 


5 (-M(5Q + 1)2 + 5g2(5Q + 3) + Q) 
but only one of them can yield non-negative concentration values (and is consistently a mini¬ 
mum): 

ge-^“ (^S{SMQ - Q{SQ + 2) + M) - ^S{M-Q)') 


C* = 


S {-M{SQ + 1)2 + SQ-^iSQ + 3 ) + g) 


(Sll) 
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Xi 

X2 

crosstalk,X 

c = o 

C = oo 

optimal C; only activators 

optimal C ; activators and global repressor 

e--^“+CS 

^+e-’^o.+CS 

1 

SQ 

1+SQ 

l-\-QZ 

CS 

e-Ea+CS 

0 

1 

QZ 

Q 1 M-Q 

+ M ^2 

QIM 

1 Q/M 

^ 1 + SQ 

Q l-j-QZ 1 M — Q QZ 

l+Z/S+QZ 

1~\-QZ 

l-\-QZ 

QZ 

M l+Z/S+QZ ' M 1+QZ 

Q l-j-QZ 1 M—Q QZ 

l+Z/S+QZ 

\-\-QZ 

M l+Z/S+QZ 'i~ M 1+QZ 


Table 3: Crosstalk errors in the basic model. Per-gene errors of the two types: xi is the error of 
a site whose cognate TF exists and the site should therefore be boimd, but is either imbound or 
boimd by a noncognate factor. X 2 is the error of a site whose cognate factor does not exist, and the 
site should therefore be unbound, but is bound by a noncognate factor. The last column shows the 
total crosstalk, averaged over all M sites. 


For small S the leading terms in the optimal concentration are 


_ e-^°Q _ e-^°Q(M-2Q) _ e-^°Q^(2M - 3^)75 ^^" ^ ^ 

~ y/SiM - Q) M-Q M-Q 

Substituting Eq. jSlH back into Eq. llS8b yields the minimal achievable crosstalk: 

= ^ {-SiM -Q) + 2^S{M-Q)) . 


(S12) 


(S13) 


Eor a constant number of co-activated genes Q, X* increases to leading order like the square 
root of S, 

X* = ‘W^ZRy/s + 0[S']. (S14) 

Substituting C* into the single gene crosstalk expressions Eqs. llS6b - (IS7t , we obtain the minimal 
per-gene crosstalk 


xl = y'SiM-Q) 

X*2 = SQ\ , ^ - 1 1 . 

[y'SiM-Q) ) 

Since crosstalk must be in the range [0,1] and M > Q, this solution 
condition that S{M — Q) < 1. Thus, minimal crosstalk has 3 regimes: 

1. Eor S > 1/{M — Q), crosstalk is minimized by taking C = 0. This is the "no regulation" 
regime. In this case, crosstalk amounts to Q/M, which is simply the fraction of genes that 
were supposed to be activated (but are not due to lack of their TEs). 

2. Eor Q > (5max(<S', M), crosstalk is minimized by taking C —oo; this is the "constitutive 
regime." Qmax(5', Tf) is given by two of the roots of the 4th order equation, S{M + SMQ — 
2Q — SQ^) — yjS{M — Q) = 0, solved for Q. We find the boundaries between the 3 different 
regulatory regimes by solving for C* {S, M, Q) = 0. 


(S15a) 

(S15b) 

is only valid under the 
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3. Otherwise, there is an optimal concentration 0 < C* < oo, given by Eq. dSllIl , that minimizes 
crosstalk; this is the "regulation regime." 

The bormdary between the first and third region is at S* = j^z:q and the boundary between the 

second and the third is at S* = ■ Hence, the second region (where C* = oo) 

only applies for Q > Fig |7(b)| illusfrafes the dependence of fhe TF concenfrafion C*, which 
minimizes crosstalk, on the number of co-activated genes Q. If demonstrates how the range in 
which 0 < C* < oo gets narrower when S increases. Fie IS6I demonstrates crosstalk and C* values 
for M = 20,000 (compare to Fig. 3 in the main text with M = 5000). 



Figure S6: Crosstalk in the basic model for M = 20,000. Panel (a) shows the minimal crosstalk, 
X*- panel (b) shows the optimal TF concentration, C*. These results are analogous to Fig. 3 of fhe 
main paper, which is computed for M = 5000. The results for two different M are qualitatively 
similar and show 3 different regimes of regulation. We make the following observations: (i) for 
larger M, the C* = 0 regime expands to include lower S values, as expected from the analytical 
solution for the regime boimdaries; (ii) if the fraction of co-activated genes, Q/M, remains constant, 
the crosstalk increases with M, as it also depends on the absolute number of inactive genes M — Q 
(see Eq. (IS13b l. The discrepancies at small Q between the black solid curve separating the "no 
regulation" and "regulation" regimes, and the numerically computed C* values are due to the 
approximation Q — 1 k. Q. 


1.1 Basic model: Dependence on variables 

1.1.1 Dependence on TF concentration 

The optimal TF concentration C* in our model arises as a trade-off between the Q genes that need to 
be active (for which a higher C is favored) and the M — Q genes that need to be inactive (for which 
a lower C is favored). Note, however, the asymmetry between the two crosstalk types: while the 
X 2 component (genes that should remain inactive) can be completely suppressed by having no 
TF (C = 0), the opposite does not hold. The xi component (genes that should be active) cannot 
be fully eliminated even for irrfinitely high C, because of fhe cross-acfivafion between the distinct 
genes that should be active; see Fie lS71 al. This trade-off varies wifh fhe relative weights of xi and 
X 2 , which depend on both Q and S. We find fhat a concentration C* that minimizes crosstalk exists 
only in the third regime ("regulation regime"). In the first regime where S < l/(dT — Q), binding 
sites are so similar that crosstalk due to the inactive M — Q genes dominates the total crosstalk. 
Hence the choice of C* =0 complefely eliminates X 2 crosstalk, and minimizes the total crosstalk. 
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(a) (b) 


Figure S7: How is optimal TF concentration C* determined? (a) xi crosstalk component (genes 
that should be active) decreases with TF concentration C, whereas X2 crosstalk component (genes 
that should remain inactive) shows the opposite trend. Curves of xi and X2 (crosstalk of a single 
gene) vs. C are illustrated for various values of S. While X2 can be fully eliminated if C = 0, xi 
has a residual component which depends on S even for irLfinite C. Both crosstalk types increase 
with the similarity between the binding sites S (compare curves with various S values), (b) The 
optimal concentration C* is a decreasing frmction of the similarity S for all Q values. At fixed M, 
the optimal TF concentration, C*, diverges with the number of co-activated genes, Q. This leads to 
the "constitutive regime," where crosstalk is mathematically minimized by taking C = oo. Shown 
is the optimal concentration C* as a function of the number of co-activated genes Q, for various 
S values; M is fixed at 5000. The value of Q at which C diverges depends on S. For small Q, we 
require M — 1/S < Q, otherwise the optimal concentration is in the C* = 0 regime. For the lower 
S values crosstalk can be minimized for 0 < Q < Qmax < M, whereas for higher S values there 
exists also a value for such that 0 < < Q < Qmax < dT. In other words, higher S 

leads to a narrower range of Q where the crosstalk can be effectively minimized. 



Figure S8: Minimal crosstalk X* is an increasing function of the similarity S and has a non- 
monotonous dependence on the number of active genes Q. The balance between genes that need 
to be active (xi crosstalk type) and genes that need to remain inactive {x 2 crosstalk type) causes 
a non-monotonous dependence of the total crosstalk on the number of active genes Q, which has 
a maximum at an intermediate Q value. Curves are shown only in the regulation regime, where 
crosstalk is minimized by a finite TF concentration. The curves are truncated at the point of transi¬ 
tion to regime II where TF concentration formally diverges to infinity. 


In the second regime, where a large number of genes Q need to be active, crosstalk due to the Q 
active genes dominates {xi type), hence C* diverges to infinity. Fig lS7f bl illustrates curves of the 
optimal concentration C* as a function of the number of active genes Q for constant values of S. 
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As Q increases, the relative weight of the genes that need to be active increases, hence C* is always 
a monotonously increasing function of Q. 

1.1.2 Dependence on the similarity S 

Both crosstalk types xi and X 2 increase with the similarity S (see Fig ISTl all. For a fixed Q, C* 
decreases as a function of S. Again, this is because for larger S the weight of the genes that should 
remain inactive is more significant, hence the trade-off shifts towards lower TF concentrations (but 
the minimal crosstalk X* still increases!). This behavior applies only in the regulation regime, hence 
for M — ^ < Q < Qraax- For larger values of Q (Q > Qmax), a more complex behavior is found 
because by changing S we pass through all three regimes: C* then first decreases, then diverges 
(because it enters the second regime), but then decreases back again. 

1.1.3 Dependence on the number of active genes Q 

The two crosstalk t 5 rpes show opposite dependence on the number of active genes Q: crosstalk per 
gene that needs to be active (xi) decreases with Q, whereas crosstalk per gene that needs to remain 
inactive increases with Q. The total crosstalk is a weighted sum of both with varying weights, hence 
it is not surprising that the total crosstalk has a non-monotonous dependence on the number of active 
genes Q with a maximum at an intermediate value; see Fig|S8l The optimal TF concentration C* 
increases with the number of active genes Q; see Fig lS7f bl. 


1.2 Basic model with regulation by repressors only 


Our basic model assumed that all gene regulation is achieved by using specific activators to drive 
the expression of genes that would otherwise remain inactive. An alternative formulation of the 
problem postulates that genes are strongly expressed without TFs bound to their regulatory sites, 
but need to be repressed by the binding of specific regulators to stop their expression. Indeed, many 
bacterial genes seem to be regulated in this way. We thus studied this complementary model, in 
which all regulators are repressors instead of activators. We assume, as before, that Q out of M 
genes should be active, but now this implies that M — Q types of cognate repressors are present for 
all the genes that should remain inactive. 

The expressions for crosstalk per gene that should be active (xi) or inactive (X 2 ) read: 


Xi 


X2 


CS 
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c 

M-Q 


+ CS 
-h CS 

■+e-E-+CS' 


(S16a) 

(S16b) 


The total crosstalk is still 


Q ^M-Q 


(S17) 


Eqs. llS16t are mathematically identical to Eqs. llSlOt . where the roles of Q and M — Q are simply 
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swapped. Not surprisingly, the minimal crosstalk in this case is: 


(M - Q)S{1 - QS) 

QS + V^ 

(S18a) 

X2 = \/^ 

(S18b) 

X* = ^^{2^-QS), 

(S18c) 


which is valid for S < \IQ. 

The optimal TF concentration that minimizes crosstalk is now 

C* = e-^-(M-Q)(l-Qg) 

v^ + Q5(2-Q5) + M5(Q5-l)' ^ ’ 

The minimal crosstalk and optimal concentration are illustrated in Fig|S9l It retains the 3 regu¬ 
latory regimes observed with activators only: 

1. For S > XjQ we obtain the “no regulation" regime where crosstalk is minimized by taking 
C' = 0. 

2. For Q < Qxnin(‘^’ -^) obtain the “constitutive regime" where crosstalk is minimized by 
taking C —oo. Q^nin obtained when C* of Eq. llS19b diverges (the denominator equals to 
zero). 

3. Otherwise, there is an optimal concentration 0 < C* < oo, given by Eq. (IS19b . that minimizes 
crosstalk; this is the "regulation regime." 

The three regions are marked with Roman numerals, in accordance with Eig. 3 of the main text. 
The boundaries between the three regimes are now: S* = l/Q (between regimes I and III) and 

S* = ^ (between regime II to both I and III). 

The results are clearly a mirror image of the results shown in Eig. 3 of fhe main text for the 
activator-only basic model. They can be obtained simply by mapping Q ^ M — Q. Since we keep 
the convention that Q is the number of genes thaf are active, the difference in regulation strategies 
amounts to having either Q activator types and keeping M — Q binding sites unbound (activator- 
only) or having M — Q repressor types and keeping Q binding sites unbound. Comparing the 
expressions for minimal crosstalk, Eq. IlSlScb to Eq. IIS13b . we conclude that crosstalk depends on 
the fraction of TEs that are expressed and on the absolute number of binding sites that need to remain 
unbound. 


1.3 Breaking the symmetry between the two crosstalk types 

In our basic model we made a simplifying assumption that the two crosstalk types, xi and X 2 , have 
equal weights: not activating a gene that should be active or erroneously activating a gene that 
should be silenced are assumed to be equally disadvantageous. We now relax this symmetry by 
allowing different weights, a and b, for fhe fwo crossfalk t 5 rpes, fo model possible differences in 
fheir biological significance. Eq. (IS8b for fhe fofal crossfalk now takes the form: 


Q ^,M-Q 


(S20) 
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Figure S9: Crosstalk in the basic model with regulation by repressors alone is a mirror image 
of regulation with activators only. Panel (a) shows the minimal crosstalk, X*-, panel (b) shows 
the optimal TF concentration, C*. These results are analogous to Fig. 3 of the main paper, which 
is computed for regulation with activators only. The observed picture is an exact mirror image of 
Fig. 3 of the main text, namely Q maps to M — Q, where we keep the convention that Q denotes 
the number of genes that should active. The difference is that in the activator-model activating Q 
genes requires Q types of activators, whereas in the repressor model this requires M — Q types of 
repressors. 


The expression for the optimal TF concentration then reads: 


e-^-Q{±y^abS{M - Q) - S{aQ - b{M - Q){1 + SQ))) 
SiaSQ^ - b{M - Q){1 + SQ)^) 


(S21) 


where again only one of the two solutions yields non-negative concentration values. The resulting 
minimal crosstalk is: 

X*ia,b) = ^{-Sb{M - Q) +2^abS{M - Q)). (S22) 

Setting a = b = 1 reduces the above formula to the previous solution, Eqs. dSllb - dS13b . Note 
the asymmetry between the two crosstalk types: if 6 = 0 , i.e., when crosstalk in genes that should 
remain inactive is insignificant, the minimal achievable crosstalk equals zero. This is not true in the 
other extreme case, when a = 0. In Fie lSlOl we show that the three different regulatory regimes still 
exist under this generalized definition of crosstalk, but their boundaries may shift. 


1.4 Breaking the symmetry between the co-activated genes 

In our basic model we imposed full symmetry between the Q co-activated genes: they contributed 
equally to crosstalk and all Q types of TFs were assumed to exist in equal concentrations. We now 
relax these assumptions. We examine the situation in which a fraction h of these Q genes is more 
important to the functioning of the cell. Mathematically, we postulate that the per-gene crosstalk 
error for the important genes contributes with a 7 -times higher weight to the total crosstalk rela¬ 
tive to the non-important genes. We introduce an additional degree of freedom to the model, by 
allowing the concentration of the TFs to split unevenly between important and other genes: each 
important gene has TFs present at concentration Co, while a TF of a non-important gene is present 
at concentration Co = r;Ci. 
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Figure SIO: The three different regulatory regimes robustly exist even if the relative weight of 
the two crosstalk types vary. To break the symmetry between the two error types we consider a 
redefined crosstalk, X(b) = ^xi + bX 2 (in the basic model b = 1). For different values of 
b (the cost of mis-activating genes that should remain inactive), all three regulatory regimes are 
preserved, although their boundaries shift. The weight of the first crosstalk type (mis-regulating 
genes that should be active) is equal in all cases. Red shows the "regulation regime," (0 < C* < 00 ). 
As erroneous activation is penalized less (decreasing b), the "no regulation" (C* = 0, white) regime 
shrinks, whereas the constitutive expression regime {C* = c», black) expands, as expected. 


As hQCo + (1 — h)QCi = C we obtain: 

(S23a) 
(S23b) 

If either h = 0 or 77 = 1 this reduces back to the basic model with Cq = Ci = C/Q. The total 
crosstalk now takes the form: 


Cl = 
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CS 

e-E.+CS’ 

where xo is the per-gene error of the important genes, xi is the error of other genes that need to be 
activated, and X 2 , as before, denotes crosstalk at genes that need to be kept inactive. 

We can optimize numerically for both the total TF concentration C and the factor 77 by which the 
TF concentration of the important genes is amplified. Alternatively, we can assume that C remains 
fixed at the optimal value for the case where all genes are equally important, and only optimize for 
77. We display the latter option in Fig ISllI to explore crosstalk at varying h under equal resource 
constraints. 


(S24a) 

(S24b) 

(S24c) 

(S24d) 
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Figure Sll: Crosstalk can be reduced for a subset of important genes at the cost of increasing 
the total crosstalk. To break the symmetry between genes, we define a fraction h (out of Q) genes 
as important, having 7 -times higher contribution to the total crosstalk. TF concentration for these 
genes is optimized separately, subject to the total TF concentration C remaining fixed to its optimal 
value in the symmetric, 7 = 1, case. We show the crosstalk per important gene, xq (red), and per 
a normal gene, xi (black), as a function of 7 (for h = 0.1). The inset shows the same as a function 
of h (for 7 = 10). Per-gene crosstalk increases approximately linearly with h and important genes 
achieve ^ ^/7 smaller crosstalk relative to normal genes. 


The special case when only a single gene is important is analytically solvable assuming Q ^ 1, 
yielding: 

^ -SQ{M -Q) + 2^S{M - Q){Q - 1 + y/p) 

^1 important gene "" m • w 

In particular the per-gene errors read: 


ySjM-Q) 


xl = ^S{M-Q) 

, _ -SQ{M -Q) + y^SjM - Q)iQ - 1 + y/p) 


(S26a) 

(S26b) 

(S26c) 


The error of the single important gene can be reduced at most by a factor of relative to the 
other co-activated genes. The x^ error for the other Q — 1 genes remains the same, because we 
assumed that Q ^ 1. Interestingly, the M — Q genes that need to be kept inactive suffer an increase 
in crosstalk as a consequence of protecting the important gene. 


1.5 Every transcription factor regulates 0 genes 

In the basic model we considered a regulatory scheme in which every gene has its own unique TF 
type. This allows for maximal flexibility in regulating each gene individually. Real gene regulatory 
networks typically have fewer TFs than the number of target genes, so that at least some tran¬ 
scription factors regulate several genes. Here we consider a simple extension of the basic model, 
in which each TF regulates 0 genes (with identical binding sites) rather than one. We assume no 
overlap between the sets of genes regulated by various TFs, so that the total number of TFs species 
is now 0 times smaller than before. If Q genes should be active, then (5/0 TF species should be 
present in a given condition. Assuming that Q/& ^ 1, we can approximate Q/& — 1 « (5/0 as 
before. The only change from the basic crosstalk formulation is in xi, because the concentration of 
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cognate factors is now 0 times larger than before: 
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2 e-Ec.+cS' 

This formulation is analytically solvable, yielding 


e* _ V S{M — Q) 

0 ^S{M - Q) ) 

__ e-^-Q{Q - S{M - Q)) _ 

® 52(M - Q)Q + S{M - 2Q)e + ^S{M - Q)03/2' 


(S27a) 

(S27b) 

(S28a) 

(S28b) 

(S28c) 

(S28d) 


The equations for minimal crosstalk are equivalent to the basic model if we map S S/Q. 
Since crosstalk depends on ^fS to first order, this amounts to crosstalk reduction by a factor of v^- 
For small S the leading term in the optimal concentration is 






Q 


VQy/S{M-Q) 


0 ( 1 ). 


(S29) 


These gains in crosstalk have, however, been achieved by sacrificing the ability to regulate each 
gene individually: now, the smallest set of genes that can be co-activated is of size 0. Typically, TFs 
might constitute > 10% of the genes 1471 : with 0 ~ 10, the crosstalk could be reduced by a factor 
of ~ 3 at best. 


1.5.1 Non-constant 0 

Until now, we assumed that each TF regulates exactly 0 genes. This assumption can be relaxed 
using numerical simulations; in particular, we considered the case where the number of genes that 
each TF regulates is a random variable drawn from a specified distribution. We started by defining 
which TF controls which sets of genes through explicit enumeration of binding site sequences. We 
assumed that the number of genes that a given TF regulates is approximately Poisson distributed 
(with mean 0) and that all these regulated genes use the same sequence for their binding site, 
equal to the consensus sequence of the cognate TF. We then sample the environments in which 
Q out of the total of M genes are active; given the regulatory network structure, not all Q picks 
out of M can be realized, as is also fhe case with constant 0 model. The crosstalk is evaluated in 
each environment exactly, by computing all thermod 5 mamic states of all binding sites, and is sub¬ 
sequently averaged by Monte Carlo sampling through the possible environments. This extension 
to the model introduces no new parameters, so its crosstalk and regime boundaries can be straight¬ 
forwardly compared to the model where 0 is constant. We find that Poisson-distributed 0 changes 
crosstalk at a below-percent level, and produces no notable shifts in regime boundaries, showing 
that our results are robust with respect to this particular distributional assumption. 
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2 Validity of the mean-field assumption 

In computing crosstalk at given M and Q, we have made a mean-field assumption on the similarity 
measure S. For a given set of binding site sequences in the sequence space (total M in number), 
this amounts to assuming that the distribution of neighbours for each binding site comes from the 
same underlying distribution. For a particular selection of Q genes, for each binding site i from the 
M binding sites, similarity Si can be defined using dij where j i indexes over fhe binding sifes 
of fhe Q selecfed genes. 


S^ = Y, (S30) 

From this, we have for crosstalk for a particular selection of Q genes. 
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(S31) 


where xi (Si) and X 2 {Si) depend on Si as shown. We are interested in the mean crosstalk X = 
(X({S'i})) over all selections of Q ouf of M genes, which requires us fo know the full distribution 
of Si- The crosstalk is then 


X = {X{{S.})) = ^[Y,{xi{S,))+ Y. (^ 2 ( 5 .)) 

ieQ ieM-Q 


(S32) 


In the mean-field assumption, we have {xi{Si)) ~ xi{{Si)) = xi{S) and {x 2 {Si)) ~ X 2 {{Si)) = 
X 2 {S), which gives us 


X = ^x,{S) + ^^l^X2iS). (S33) 

From this, one can obtain the optimal crosstalk X*. To check the validity of such a mean-field 
assumption, we performed numerical simulations by drawing lists of M binding sites from the 
sequence space, computing optimal crosstalk by explicit enumeration of all fhermodynamic 
stales, and comparing this with the mean-field crosstalk X*. In detail, we first picked M binding 
sites (to regulate M genes) randomly from the sequence space and held this choice fixed. Now, 
for each Q, we performed rigei differenf selections of Q ouf of M genes. For each such selection, 
after computing the binding site mismatches and occupancies, we compute the crosstalk. To get 
the mean crosstalk for Q, we perform a Monfe Carlo esfimafe of the mean crosstalk over these risei 
different selections of Q out of M genes. Figures [S12I and |S13I show fhat fhe mean-field crosstalk 
systematically over-estimates the actual crosstalk, but nevertheless remains a very good approxi¬ 
mation to the true crosstalk. 

3 Mixed models 

In the baseline model we consider M genes, all of which are regulated either solely by activators 
or solely by repressors. Flere, we consider mixed models, i.e., models that utilize repression to 
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Figure S12: Comparison of mean-field results and numerical simulations. On the left, we plot 
the difference in optimal crosstalk between simulations and the mean-field approach, — X*, 
for different Q and S. On the right, we plot — X* against Q for three different S. Here, 

M = 5000, L = 10, and S has been varied by tuning e. Xis a Monte Carlo estimate of the mean 
crosstalk, obtained over Ugei different selections of Q out of M genes, risei = 1 in the top row, and 
risei = 30 in the bottom row. The mean-field approach is in general a very good approximation of 
the simulations. The maximal crosstalk difference is less than 0.02, and decreases with increasing 
S. 


control one subset of genes and activation to control the other genes. Let's assume that Ma genes 
are regulated by activators and genes are regulated by repressors, where M = Ma + Mr. 
In a particular environment, let's assume that Q genes need to be ON. Out of these, let's assume 
that Qa genes are activator-regulated and Qr genes are repressor-regulated, where Q = Qa + 
Qr. For activating Q genes, the number of TFs present now amounts to T = Qa + Mr — Qr: 
Qa activators and Mr — Qr repressors. As before, S is the similarity of the binding sites and C 
the total concentration of TFs (activators+repressors). The concentration of a particular TF type, 
when present, will now be C/T. We assume that any non-cognate interaction ("activation out- 
of-context" or "repression out-of context") counts as a crosstalk error. We distinguish 4 types of 
per-gene crosstalk errors: 

An activator-regulated gene that needs to be ON, should be bound by the cognate activator. 
The unbound state and any non-cognate binding (non-cognate activator or repressor) are crosstalk 
states: 
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Figure S13: Comparison of mean-field results and numerical simulations. On the left, we plot the 
difference in optimal crosstalk between simulations and the mean-field approach, — X*, for 
different Q and S. On the right, we plot — X* against Q for three different S. Here, M = 500, 
L = 8, and S has been varied by tuning e. is a Monte Carlo estimate of the mean crosstalk, 
obtained over risei = 100 different selections of Q out of M genes. Again, as with M = 5000, 
the mean-field approach is a very good approximation of the simulations. The maximal crosstalk 
difference is only slightly larger than 0.02. 
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An activator-regulated gene that needs to be OFF, should be rmbound. Any non-cognate bind¬ 
ing is a crosstalk state: 
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{Ma — Qa out of M genes). 


(S35) 


A repressor-regulated gene that needs to be ON, should be unbound. Any non-cognate binding 
is a crosstalk state: 


CS 

+CS ^ genes). (S36) 

Lastly, a repressor-regulated gene that needs to be OFF, should be bormd by the cognate re¬ 
pressor. The unbound state and any non-cognate binding (non-cognate repressor or activator) are 
crosstalk states: 
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e-E. + cs 
C + g-is„ + cs 


{Mr — Qr out of M genes). 


(S37) 
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As = X2 and = xf, the overall crosstalk error reads 
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= X(Qeff = r,Meff = M). 


(S38) 


Hence, given a set of {QatQr,Ma,Mr) of fhe mixed model, crossfalk is same as fhat in an 
equivalent baseline activator model with Qeff = T = Mr + Qa — Qr and Mgs = M = Ma + Mr. 

For a given M, different {Ma,Mr) partitions are possible, which differ in the number of genes 
under acfivafor or repressor control. This can be timed on an evolutionary timescale. Once Ma is 
chosen, different selections of Q genes fhat should be active potentially have different numbers of 
genes under the control of activators {Qa) and repressors {Qr = Q — Qa)- However, the optimal 
TF concentration C* and the minimal crosstalk X* only depend on the total number of TFs T. 

For given M, Q, and S, we find the best possible Ma, which minimizes the crosstalk. For a par¬ 
ticular Ma, we define fhe opfimal crosstalk as the average optimal mixed crosstalk for all selecfions 
of Q genes out of M (averaged over differenf choices of Qa), 


X*{M, Q, S, Ma) = Y, A;;i,,d.fun(QA, M, Q, 5, Ma). (S39) 

Qa 

where Pqj^ is the fraction of Q gene selecfions that have Qa activated genes. We have 


Pqa 


(Ma\(M-Ma\ 
\Qa)\Q-Qa) 



(S40) 




min 


X*{M,Q,S,PIa) , 


(S41) 


M^ = aigminX* {M, Q, S,Ma), (S42) 

Ma 

where M^ is the Ma value which minimizes crosstalk for a given Q. In Fie. lS141 we see fhat for 
Q < M/2, the best strategy is to use all activators {Ma = M), and for Q > M/2, the best strategy is 
to use all repressors; optimization of crossfalk in mixed models fherefore always picks ouf one of 
the two "pure" regulatory strategies and does not yield an optimal mixed model. 

To see if the pure strategies get chosen because the activation of all genes is symmefric in all 
environmenfs, we studied a simple system in which different subsets of genes are required to be 
activated with different probabilities. So far, when Q genes are required to be ON, each gene had 
the same probability, Q/M, to be among the Q out of M required genes, i.e. Q/M is fhe probabilify 
of each gene to be activated. 

Here, we introduce two classes (1 and 2) of genes, with Mi genes in the first class and M 2 = 
M — Ml genes in the second class. Genes in each of the two classes have different probabilities 
of requiring activation across environments: Pi for fhe first class and P 2 for the second class. If 
Pi > 0.5, fhen genes in class i are called "hot" genes, and if Pi < 0.5, genes in class i are called 
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Figure S14: Mixed model at best M^. On the left, we plot the optimal number of activated genes 
M\ for different Q at M = 500 and log(5') = —10.5. For Q < 250, it is best to have all genes under 
activator control (M^ = 500) and for Q > 250, it is best to have all genes under repressor control 
(M^ = 0). On the right, we plot the optimal mixed crosstalk, computed at M\, and averaged over 
different gene selections using Pq^ . 


"cold" genes. Given certain Mi, M 2 , Pi, and P 2 , different environments correspond to different 
choices of the Q genes that should be active, where Q is no longer constant as before, but a random 
variable with mean 


(Q) = PiMi + P2M2. 

In a similar fashion as before, we compute the crosstalk (at optimal C*) for different choices of 
mixed models (how many class i genes are under activators or repressors). Then, we obtain the 
optimal {MA,Mfi) strategy among these mixed models that minimizes crosstalk. In Fig. IS151 we 
show how this optimal strategy varies, along with (Q), as a function of Pi and P 2 for a fixed choice 
of Ml = M 2 = 2500. First, we note that {Q) increases in any direction that increases Pi or P 2 . In the 
symmetric mixed model setup, we essentially studied the system along the diagonal from (0,0) to 
(1,1) on the (Pi, T 2 ) plane (dashed white line), increasing {Q) from 0 to M. The previously studied 
results yielded two "pure" strategies—all activators or all repressors, depending on whether Q is 
bigger or smaller than M /2—which is consistent with the following observations in the asymmetric 
mixed models. When Pi < 0.5 and P 2 < 0.5 (all genes are cold), the optimal strategy is a pure 
one, namely, to put all genes under activators; when Pi > 0.5 and P 2 > 0.5 (all genes are hot), 
the optimal strategy is to put all genes under repressors, which is also a pure strategy. But when 
Pi > 0.5, P 2 < 0.5 or Pi < 0.5, P 2 > 0.5 (one class is hot, while the other is cold), the optimal 
strategy is "mixed": put hot genes under repressors and cold genes under activators. Note that 
not all (Q) are possible with these optimal mixed strategies. From here onwards, we study mixed 
models in the bottom right square of Fie. IS151 where Pi > 0.5 and P 2 < 0.5, i.e., class 1 is hot and 
class 2 is cold. 

At fixed Pi and P 2 , crosstalk gains from using the optimal mixed strategy (instead of using all 
activators) increase with both S and the number of hot genes Mi, as shown in Fie. ISlbl 

In Fig. IS17I we show in detail the crosstalk gains from using the optimal mixed strategy instead 
of the optimal pure strategy (either all activators or all repressors), for different (Q) and S, for four 
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Figure S15: When some genes are hot and other genes are cold, the optimal mixed strategy puts 
hot genes under repressors and cold genes under activators. Flere we show how the optimal 
strategy and {Q) vary as a function of Pi and P 2 for a fixed choice of Mi = M 2 = 2500. {Q) 
increases in any direction that increases Pi or P 2 . When Pi < 0.5 and P 2 < 0.5 (all genes are cold), 
the optimal strategy is a pure one (all genes under activator control), while when Pi > 0.5 and 
P 2 > 0.5 (all genes are hot), the optimal strategy is to put all genes under repressors, which is also a 
pure strategy. But when Pi > 0.5, P 2 < 0.5 or Pi < 0.5, P 2 > 0.5 (one class is hot, while the other is 
cold), the optimal strategy is "mixed": hot genes are under repressor control and cold genes under 
activator control. 
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Figure S16: Crosstalk gaiirs from using the optimal mixed strategy instead of all activators. Plot¬ 
ted is the difference in optimal crosstalk (crosstalk gain), ~ ^mb' between the pure strategy 

of using all activators and the optimal mixed strategy of putting hot genes under repressors and 
cold genes under activators, as a function of S, with fixed Pi =0.75 and P 2 = 0.25. As S increases, 
we cross from the regulatory regime III to regime I in which C* = 0 . The optimal mixed strategy 
becomes increasingly better (than the all activators pure strategy at reducing crosstalk) as S and 
Ml increase. 












different Mi = 500,2000,3000 and 4500. 


4 Alternative crosstalk definition 

In the basic setup presented in the main text, we considered "activation out-of-context"—i.e., ac¬ 
tivation by the binding of a noncognate TF when the cognate TF is present (but not bound)—to 
be a crosstalk state. Our reasoning was motivated by viewing transcriptional regulation as a sig¬ 
nal transmission apparatus. In this interpretation, gene activation by a noncognate TF amounts to 
generating a response (transcriptional activity) to a wrong input signal. Consequently, this should 
count as crosstalk, despite the fact that (by chance) the correct signal was simultaneously present in 
the cell. This is perhaps easiest to appreciate if one considers more realistic setups in which genes 
are not simply "ON" and "OFF", but can be quantitatively regulated by the level of their cognate 
TF. In such a model, there might be two TFs present and varying in concentration as a function of 
time: one cognate for the gene of interest and one not. In this case it is clear that the correct response 
of the gene is to track the changes in the cognate TF, and not to simply be expressed in a constant 
"ON" state; consequently, tracking the noncognate TF due to crosstalk is obviously an error, even 
if the cognate TF is present at the same time. 

One could, however, argue that "activation-out-of-context" shouldn't be considered as an error 
state. If the presence or absence of TF signals is a binary variable and if the binary response is 
defined solely by the state of transcriptional activity (activation/inactivation of gene), then when 
the presence of the signal matches the response state, the regulation outcome is correct, irrespective 
of the molecular details on the promoter. For example, for a gene whose cognate TF is present, 
activation by any means (either by cognate or noncognate binding) is the correct response. In this 
scenario, the "out-of-context activation" is actually what one might call beneficial crosstalk: here, 
noncognate TF can be seen as helping to activate the gene when the cognate TF is also present. For 
a gene whose cognate TF is absent, activation is still an incorrect response, like before. 

Hence, X 2 {i) retains the same expression, but xi{i) changes to 


xi{i) 


Ci + ' 


(S43) 


As shown in Fig. IS18I optimizing C results in three distinct regulatory regimes, like in the de¬ 
fault basic setup. For small S in the regulation regime, the optimal C is given to the leading order 
by: 


Q 

i/S y/M — Q 


(S44) 


The minimal crosstalk error at the optimal concentration C* is given by 
A* = -SQ + 2%^S{M-Q)il + SQ) 


(S45) 
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Figure S17: Optimal mixed strategy is increasingly better than the optimal pure strategy at in¬ 
termediate Ml and larger S, at the border of the two regimes. Here, we plot the crosstalk gains, 
in the top row, or X*j[_j,gp — in the bottom row) from using the optimal mixed 
strategy instead of the optimal pure strategy as a function of the average number of genes required, 
{Q), and S, for different Mi. For Mi < M12 = 2500, the optimal pure strategy is to use all acti¬ 
vators and for Mi > M/2 = 2500, the optimal pure strategy is to use all repressors. Note that for 
Ml > M/2, - x;;, at ((Q), 5) is equal to - X;;j, at M{ = M - Mi < M/2 and 

(M — {Q) ,5); they are laterally inverted mirror images. In general, the optimal mixed strategy gives 
a lower crosstalk than the optimal pure strategy for intermediate Mi. At the baseline parameters 
of {Q) = 2500, M = 5000, log(S') = —10.5, for Mi = 500 and 4500 both, the crosstalk gain is 0.03, 
while for Mi = 2000 and 3000, the crosstalk gain is 0.09. For a particular Mi, crosstalk gains are 
larger both at larger S and larger (smaller) {Q) for Mi > M/2 {Mi < M/2). We obtain different {Q) 
on the x-axes as {Q) = Pi Mi -I- P 2 M 2 by varying (Pi, P 2 ) along the solid white line of Fig. ISlSI from 
(0.5,0) to (1,0.5). 
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Figure S18: Basic model with alternative crosstalk definition also exhibits three distinct regula¬ 
tion regimes. The alternative definition does not count "activation out-of-context" as an error state. 
(A) Minimal crosstalk error, X*, shown in color, as a function of the number of coactivated genes 
Q, and binding site similarity S. (B) Optimal TF concentration C*, that minimizes the crosstalk, 
relative to Cq, the optimal concentration at the baseline parameters (see main text). 


5 Estimating the binding site similarity, S 

5.1 Optimal packing 

In real organisms, binding site sequences for different genes could depart from a random distribu¬ 
tion (even after taking into account the statistical structure of the genomic backgroimd). For ex¬ 
ample, to achieve high specificity of regulation, we could hypothesize that binding site sequences 
evolved to minimize the overlap between any pair of consensus sequences. To explore the crosstalk 
limit under such optimal use of sequence space and contrast it with the random choice of binding 
sites, we synthetically constructed binding site sequences that are as distinct as possible. Specifi¬ 
cally, our optimal codes are described by a parameter dmin/ which is the minimum required number 
of basepair differences between any pair of binding site sequences. This is the Hamming distance, 
HD, between sequences. The problem of choosing M sequences of length L such that each pair 
differs by at least dmin is not tractably solvable in general. We construct numerical approximations 
to these optimal codes using the following algorithm: 

1. Generate all possible sequences of length L and store them in a list called words. Create an 
empty list, called codewords, which will store the binding site sequences. 

2. Pick the first entry, s, from the list words, to be a binding site sequence, and append it to the 
list codewords. 

3. Erase s and all of its Hamming neighbours at distance strictly less than dmin from the list 

words. 

4. If the list words is not empty, repeat from step 2. If the list words is empty, stop. 

When the procedure terminates, the list codewords will contain binding site sequences that are 
separated by at least dmin mismatches. The outcome of this procedure depends on the initial order¬ 
ing of the list of all possible sequences. The procedure is not guaranteed to generate the maximal 
set of sequences satisfying the Hamming distance criteria. From the list of generated binding site 
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sequences, we obtain P{d), the distribution of mismatch distances between all pairs of binding 
sites, and hence obtain the value of S as 


Sidnun)= Y. Pid)e-^''- 

d^djiiin 


(S46) 
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Figure S19: Optimal packing. This alternative model with optimal packing of binding sites in 
sequence space leads to values for S (y-axis) that can be remapped to the S'(e, L) (x-axis) for the 
random code with the mismatch energy model, E{d) = ed and L = 10 bp binding sites (corre¬ 
sponding scale for e shown in the top axis). Dashed lines denote equality Optimally designed 
binding sites effectively decrease S. Here, their sequences are at least dmin bp distant from each 
other (gray lines = different dmin as indicated). 

drain = 0 corresponds to the "random code" and results in S(dinin = 0) = S' = (^ -I- |e“'^)^. 
Note that increasing dmin decreases the maximum possible M as sequences move further apart 
in sequence space whose size is fixed. A well-known upper bound on the number of sequences 
satisfying the Hamming distance criterion is the Singleton bound f65ll : M{dmin,L) < 

As shown in Fig. IS201 with L = 8 and dmin = 3, we already have M < 4096. With L = 10 and 
drain = 4, we have M < 16384. As L becomes smaller, the possible range of M also decreases. This 
suggests that prokaryotes are capable of having optimally packed binding site sequences, because 
they typically have L > 10 and M < 10^. On the other hand, eukaryotes have smaller L and larger 
M and might not have enough sequence space to pack it optimally. 

5.2 Reverse complemented sequences 

We have also considered a different definition of distance between sequences that takes the double- 
stranded nature of DNA into account. This brings into picture the reverse complement of both se¬ 
quences in question. If Si and Sj are two sequences with reverse complements and rj respectively, 
this new definition of Hamming distance is 


dd^j) 


min 


HD{si,Sj), dID{n, 


Sj),HD{si,rj),HD{ri,rj) 


(S47) 


where HD{si,Sj) is the usual Hamming distance as considered previously. This restricts the 
sequence space much more than with the usual definition and as such, as seen in Fig. IS201 we 
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can pack fewer binding sites in the sequence space at a specific dmin- Given that there are enough 
sequences under HDrc rneasure in the sequence space, we can also ask how S changes in relative 
to the random code. Intuitively, S should increase since each binding site sequence also contributes 
its reverse complement into the pool of sequences to which TFs can bind non-cognately. Indeed, 
Fig. IS21I which maps S from the reverse complement code to S from a random code, shows that S 



Figure S20: Bounds on the maximal number of binding site sequences for different dmin with 
binding sites of length L = 8. Two bounds from the coding theory (Singleton upper bound and 
Gilbert-Varshamov (GV) lower bound Il65l l are shown together with the values of M obtained by 
our numerical approximation procedure. These are shown both for the usual definition of distance 
between sequences as the Hamming distance, HD, as well as for a definition that considers the 
reverse complements of the sequences, HDrc- For dmin = 0 there are M = 4® Ri 65000 possible 
sequences where all sequence pairs are at least dmin distant from each other, but the number quickly 
decreases with increasing dmin- From the HD io HDrc, the Singleton bound doesn't change from 
the usual situation but the Gilbert-Varshamov (GV) bound, which takes into account the "volume 
of restricted ball" around each sequence, goes down. Because of stronger constraints, the number 
of sequences that can be packed goes down from the usual situation but only by a factor of ~ 2. 
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Figure S21: Reverse complemented sequences. Using an alternative definition of distance (HDrc) 
between binding site sequences, which takes into account the double-stranded nature of DNA by 
considering the reverse complements as well of the sequences in question, leads to values for S 
(y-axis) that can be remapped to the S{e, L) (x-axis) for the random code with the usual Hamming 
distance definition, HD. Here, we have considered L = 8 bp binding sites (corresponding scale 
for e shown in the top axis). Dashed lines denote equality. This alternative definition increases S 
because more sequences are now formd in the "shells" around the consensus to which the TF can 
bind on the reverse strand. S increases by about a factor of 2. 

5.3 Saturating model of TF-DNA binding energy 

It has been experimentally observed that the binding energy between TF and DNA saturates to 
some nonspecific value after a certain number of mismatches between the TF's cognate sequence 
and the DNA sequence in question IfT^ . We consider such a saturating energy model, characterized 
by a parameter do, the number of mismatches after which binding energy saturates. The binding 
energy is given by E{d) = emin(d, do). We obtain S as 


S{do)=Y,Pid)e-^^‘^\ (S48) 

d 

where P{d) is the distribution of mismatch distances between all pairs of binding sites picked 
at random from the sequence space, do = L corresponds to a mismatch model with non-saturating 
energy. Decreasing do limits the specificity of the TF towards binding site sequences far away from 
the consensus and thereby increases S{do). 
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Figure S22: Saturating energy model. An improved affinity model where the mismatch energy 
saturates after do mismatches, E{d) = e min(fi, do) (gray lines = different do as indicated), effectively 
increases S', do ~ 4 has been reported experimentally Ifl^ . This alternative model leads to values 
for S (y-axis) thaf can be remapped to the S(e, L) (x-axis) for the random code with the mismatch 
energy model, E{d) = ed and L = 10 bp binding sites (corresponding scale for e shown in the top 
axis). Dashed lines denote equality. 

5.4 Empirical values 

We obtain organism-specific estimates of S from known databases l66l l67l l68l of fhe binding site 
sequences of different TFs. In the main text, for a particular genome, we defined S for a collection 
of TFs with the same mismatch penalty e and binding sites of a specific constant length L. In 
real organisms, different TFs have different e and L, making it difficult to directly calculate S for a 
genome. Instead we obtain a value of S for each TF by defining it as the value of S' of a hypofhetical 
genome in which all TFs have the same binding site properties (e, L) as our TF. Fience, for each 
organism, we obtain a set of S values. 

Many databases document the binding site sequences of TFs in Position Count Matrices (PCMs). 
The PCM of a TF with a binding site of length L is a 4 x L matrix B with denoting the number 
of known TF binding site sequences that have nucleotide i in position j. One can obtain estimates 
of e and L from B, and use them to calculate S. There are two broad ways to estimate e and L 
(and hence, S) of a TF: (a) Information method, (b) Pseudo-count method. In (a), we calculate 
the information contained in the whole binding site motif and obtain an e that distributes this 
information uniformly among all sifes in an equivalent "effective" motif that has the same length 
as the original, but only has 0 or e mismatch energy values. In (b), we obtain e for all entries of 
fhe PCM and calculate an average e from fhese entries. To handle zeros in the PCM which lead 
to rmdefined e, (b) uses an arbitrary pseudo-count. Method (a) can, in contrast, avoid the use 
of pseudo-counfs and, additionally, reproduces by construction the information content of each 
known motif, which is the key statistical property of TF specificity iri4l[69ll . Hence, we used (a) to 
infer S values. In both the methods, we used PCMs that have that have been constructed from at 
least 10 distinct binding site sequences. 

5.4.1 Information method 

In this method, we first obtain the binding site length L and also the total information I, contained 
in the binding site sequences of the TF. 
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(S49) 


J 0 i 

where Ij is the information contained in position j, pij is the frequency of nucleotide i in posi¬ 
tion j, obfained in a straightforward way from B, and is fhe expecfed background frequency. To 
gef rid of non-specific positions, we neglect all positions that contain information less than a cer¬ 
tain threshold {Ij >0.2 bits for position j to be considered part of fhe binding site). For a random 
genome, = 0.25 V f, j, resulting in 


/ = 2h -I- ^ Py log2 Pij (S50) 

iJ 

The maximum information in the motif is 2L bifs (when e —^ oo) with each position contributing a 
maximum of 2 bits, which for finite e, is reduced by an entropy term. Obtaining information per 
position Ipos = I jL, we infer an e fhat uniformly disfribufes fhe information in the motif among 
individual positions. At a specific position j*, without loss of generalify, assume fhat f = 4 has the 
best binding energy (= 0). The probability of observing f = 4 at j* is given by p 4 = XjZ while the 
probability of observing any of fhe fhree other possible nucleotides is given by pi, 2,3 = jZ, with 
Z = \ + l42ll . Hence, 


I VOS = 2 l0g2 Pi (S51) 

i 

= 2 - ^ log 2 z - 3^ log 2 Z (S52) 

= 2 - log2 Z -h (S53) 

The mismatch energy e can be obtained from the above expression, and from e and L, we obfain 
^(e,L) = (i + |e-)^ 

5.4.2 Pseudo-count method 

In this method, we infer e for all fhree non-cognate nucleotides in each position, and obtain e for 
fhe TF as an average of fhese 3L values. For an arbitrary position j, as before, assume that f = 4 
has the maximum counts {b 4 j > bij , f = 1 ,2,3). We obtain = log and mismatch penalty for 
position j as = ^{eij -I- e 2 j + ^sj)- If some entry bkj = 0, ekj is undefined. To take care of this, 
we first add a pseudocount S to all entries of B and obtain a modified PCM Bg to infer e. The value 
of S chosen is arbitrary and it is common practice to use <5 = 0.5 or ^ = 1. As before, to get rid of 
non-specific posifions, we consider positions that have Cj >1. From the remaining, we take a mean 
to obtain e = e^, and finally obtain S{e, L) = {j + 

j 
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Figure S23: Boxplots of S for TFs from different databases. In each panel, organism-specific 
(from a single database) boxplots of S are shown. The first boxplot in each panel corresponds 
to S values obtained from information estimates, and the remaining four correspond to S values 
obtained using the psuedo-count method with <5 = 0,0.1,0.5,1 from left to right. E. coli TFs were 
obtained from RegulonDB l66l and yeast (S. cerevisiae) from two different databases - scerTF ll^ 
and JASPAR 1671 . All the other organism specific TFs were obtained from JASPAR. Notice that in 
the pseudo-count method, <5 has the biggest influence on the estimates in E. coli. Importantly, for 
all other organisms, the estimates are invariant to <5 and agree well with the information estimate. 


6 Cooperative regulation 

So far, we assumed a single binding site for every gene. Yet, some genes employ combinatorial 
regulation, with several binding sites regulated by a number of transcription factors. As a next step 
in extending our model we consider cooperative regulation, where every gene has two binding 
sites that are bound by two copies of the same type of transcription factor. 

We assume 2 binding sites per gene, with energy gap Ea between cognate-bound and unbound 
states. An additional energy contribution A is obtained if both sites are bound by cognate factors, 
which then interact with each other. We consider also the configuration that two noncognate fac¬ 
tors of the same type bind to the double binding sites and interact with each other as well. In the 
limit that A ^ Ea once one of the sites is bound, the binding of the other becomes energetically 
favorable. This cooperative binding energy only applies for two molecules of the same type. Thus, 
if one site is bound by the cognate and the other by a noncognate molecule, cooperative interaction 
doesn't apply. We assume that only binding of one of the two sites induces transcription. The rea¬ 
soning for this assumption is that for many bacterial and yeast genes activators are thought to work 
by recruiting the transcriptional machinery to the DNA llTOl . Following this rationale, only one of 
the two sites is in the correct physical location (in bacteria, the proximal one) to do so successfully. 
Technically, if we assume that only one of the two sites determines transcription, for A = 0, the 
cooperativity case reduces back to the basic model (Section[T]l. We list the possible binding config¬ 
urations of the two sites, their energies and statistical weight in Table S) 

The general case of this model, incorporating all possible binding configurations yields a 6th or¬ 
der equation in the TF concentration C, which we only handle numerically. The following limiting 
cases are however analytically solvable: 

1. Limit of strong cooperativity: Assume that the cooperative interaction is strong compared to 
the individual protein-DNA binding energies Ea- We can then neglect binding configu- 


47 



























configuration 

activity 

crosstalk 
if ON 

crosstalk 
if OFF 

strong 

cooperativity 

Energy 

Weight 

1 

CC 

ON 

- 


+ 

0 


2 

uc 

ON 

- 



Ea + A 


3 

NC 

ON 

- 



“h cd 

C^IQSe-^ 

4 

uu 

OFF 

-H 

- 

-H 

2Ea + A 

e-2£„-A 

5 

cu 

OFF 

-H 

- 


Ea + A 


6 

NU 

OFF 

-H 

- 


Ea + A + 


7 

UN 

* 

-H 

+ 


Ea + A -f ed 


8 

CN 

* 

-H 



A “h cd 

C^IQSe-^ 

9 

N,Ny 


-H 

-H 


A + e{di + d 2 ) 

C^S^e-^ 

10 


* 

-H 

+ 

+ 

2ed 

^S{2e,L) 


Table 4: All possible binding configurations and the corresponding energies for a two-binding 
site model with cooperative interaction. 'C' denotes binding by cognate factor, 'N' - binding by 
noncognate and 'U' - means that the site is unbound. We distinguish between binding of noncog¬ 
nate molecules of the same type (N^Nx) and different types (N^Ny), where in the former there is 
also cooperative interaction between the molecules. We define the reference energetic level E = 0 
as the state 'CC' when both sites are bound by cognate factors with cooperative interaction, such 
that all other energies are positive. We assume that the left binding site is the auxiliary and only the 
right one determines the state of activity. Note that the statistical weight of the last binding config¬ 
uration N^Nx uses 5'(2e, L) instead of the otherwise S{e, L). The column 'activity' denotes whether 
in the given configuration the gene is either ON, OFF or * - could be either active or inactive (pos¬ 
sibly active in response to noncognate signal). Blank space denotes a non-existing configuration 
(or one which is not accounted for): these are the configurations including a cognate factor bound 
in the situation that it is absent because the gene should be silent. The next two columns denote 
whether this configuration was counted as crosstalk (+) or not (-) if the cognate transcription fac¬ 
tor is present and the gene should be activated or if it is absent (and the gene should be silenced). 
The 'Strong Cooperativity' column denotes the configuration included under strong cooperativity 
approximation. 


rations in which only one of the sites is bound and the other is vacant, and the ones in which 
both are bound, but by molecules that do not interact cooperatively. That leaves us with only 
3 possible binding configurations: both sites unbound, both bormd by cognate TF or both 
bound by noncognate TF molecules of the same type with cooperative interaction (configu¬ 
rations 1,4 and 10 in Table SJ. By proper change of variables this case can be reduced back to 
the basic single-binding-site model. The minimal crosstalk then reads: 

-Q (~S{M -Q) + 2^S{M-Q)\ 

^coop =-^- 77 -(S54) 
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where S = S{2e, L). This error is achievable with TF concentration 
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'sQiM -Q) + M-2Q"j 

+ ^S{M-Q)^ 


(S55) 


Since the cooperative binding model allows for a binding site which is twice as long and 
higher total binding energy the parameters need to be correctly transformed to compare to 
the 1-site model. If we transform: S ^ S we obtain exactly the same minimal error as in 
the single-site model. By proper transformation of the energy of the unbound state Ea = 
A -I- 2Ea the TF concentration that minimizes the error is a square root of the one we had 
in the single-site model Eq. (ISllb . In similarity with the basic single-site model, here too we 
obtain different parameter regimes, whereas For S = S{2e,L) > the minimal error 

is obtained by taking C = 0, namely regulation is not advantageous. While seemingly the 
cooperative binding is equivalent to a 1 -site model which has twice as long binding site, this 
is not accurate. The reason is that cooperative interaction occurs only between two specific 
molecules, which limits the possible sequence space. 


2. Limit of weak cooperativity: If A = 0, the problem reduces to the basic single-site model. 


6.1 Cooperativity with interactions between noncognate pairs 

In Fig. 4 of the main text we neglected the possibility of cooperative interaction between pairs of 
noncognate molecules at the binding site of interest. This situation is plausible if the interaction 
between the molecules is facilitated by the specific binding sites. However, the molecules can also 
cooperatively interact in solution before binding and then bind a noncognate site as a complex. 
This possibility was not taken into account in Fig. 4 (main text). In the following we repeat the cal¬ 
culation including this interaction too (state no. 10 in Table|4]l. The results are illustrated in Fie lS24l 
Evidently, the improvement in crosstalk owing to cooperativity is now significantly smaller. 


7 Combinatorial regulation (AND gate) 

So far, we have been dealing with models in which each gene is regulated by a single type of TE, 
be it by a single activator, a single repressor, or multiple TFs of the same type using cooperative 
interactions. Here, we will consider a simple model of combinatorial regulation by a combination 
of two activators of different types, and compute optimal crosstalk for this setup as a function of 
parameters of interest. 

As before, we have M genes in total, with each gene having two binding sites, corresponding to 
two different (cognate) TE types. Eor a particular gene to be ON, we need the presence of both cog¬ 
nate TE types, which need to occupy both binding sites. This regulatory architecture corresponds 
to an AND gate. We don't specify how this AND gate is implemented on the molecular level. Un¬ 
like in cooperative regulation, no additional energy gain is assumed here due to the interaction 
between the two TEs when bound to the DNA. 

Each TE can pair with various other TEs in regulating a particular gene. In the basic activation 
setup, the total number of TEs, M, was equal to the total number of genes. In the combinatorial 
regulation setup, which is an extension of the basic activation setup, the total number of genes M 
will be equal to the total number of different TE-TE combinations that can exist. This will depend 
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Figure S24: Crosstalk when any pair of the same type TFs interacts cooperatively, even if bound 
to noncognate site. Here we repeat the calculation of Fig. 4 of the main text where we also account 
for cooperative interaction between the noncognate binders. This significantly decreases the benefit 
of cooperative interaction, although it still shows some improvement compared to the single-site 
basic model, (a): Difference in crosstalk compared to the basic model with single site, 2f^oop ~ ’ 

where the strength of the cooperative interaction is A = 10. One outcome of this is that the C* = 0 
(no regulation regime) becomes significantly larger (compare to Fig. 4B). (b): Minimal crosstalk 
obtained for different intensities of cooperative interaction. In contrast to the case shown in the 
main text Fig. 4C, where increased cooperativity always reduces crosstalk, here the improvement 
is limited. For example, increasing cooperativity from A = 5toA = 10 brings about only a minor 
improvement, (c): Optimal TF concentration decreases with increased cooperativity, as in Fig. 4D. 
Circles denote transition to C* = 0 - no regulation regime. 


on the extent of combinatorial regulation, which we quantify using /, the fraction of TF-TF combi¬ 
nations each TF type realizes out of the theoretically maximal number of pairwise combinations it 
could have. 

If there are T TFs in total, each TF can potentially pair with Ni^t = other TF types, where 

/ is the fraction of pairs each TF type realizes. This gives us M = rAi„t/2, and thus T y/2M/ f 
and fVint ~ y/2M /. But each TF should pair with at least one other TF, so we require Ni^t >1- 
Taking both of these limits into account, we have, for Nint, the number of TFs each TF pairs with, 
and the number of total TFs T, 
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iVint = max(l, ^j2Mf) 

T= — 
iVint' 

If each TF pairs with all other TFs, we have / = 1 and iVjnt = T — which gives us T ~ y/2M. 
We call this "perfect combinatorial regulation" because it minimizes the number of TFs needed to 
regulate a certain number of genes. 

If each TF realizes only a fraction 1/2M < / < 1 of its combinations, we have A^int > 1 pairs for 
each TF, which gives us T Ri y^2M/ f. We call this "imperfect combinatorial regulation". 

If / < 1/2M, we have Ni^t = 1, which gives us T = 2M. We call this "worst combinatorial 
regulation". 

As before, we will compute the optimal crosstalk when Q genes are required to be ON. Here, 
we compute the "t 5 rpical" number of TFs present at any one time, t, by following a similar recipe 
as before. We have Q = fnint/2, where nint is the number of pairs per TF present at any one time. 
This will be smaller as there are fewer TFs present at any given time relative to the total number of 
TF types, i.e., t <T. As before, we have 


(556) 

(557) 


riint = max(l, y^2Qf) 

t = ^ 
n-mt 

When / > 1/2(5, we have t = \/2Qj j and when / < 1/2(5, we have t = 2Q. 

Unlike in the basic activation setup, Q genes that are required to be ON have two cognate TFs 
present, but genes that are required to be OFF have either none of the cognate types present, or 
one (but not both) of TF types present. As calculated above, we have t TFs and each TF has riint 
combinations, while the total number of combinations it can have are Ajnt; each TF that is present 
therefore has Aint — riint missing combinations. The number of genes (that should be OFF) which 
have only one TF present can be obtained as 

^ f(Ai„t — riint) icrcw 

Qi = - 2 -■ o60) 

The number of genes with no cognate TFs present is (5o = M — Q — Qi. In Table |5l we have 
listed all possible configurations for the two binding sites of a gene, along with details of crosstalk 
states and statistical weights. From this, we get the per-gene crosstalk for different types of genes. 
For genes that have both cognate TFs present {Q out of M), the per-gene crosstalk error is 


(558) 

(559) 


^both — 1 


_ {C/^ _ 

{C/ty + 26-^“ (C/t) + 2{C/t)CS + 2e-E‘^CS + {CSy -h {Clt)CS{2e, L) + 


(S61) 


For genes that have only one of the two cognate TFs present {Qi out of M genes), the per-gene 
crosstalk error is 


{C/t)CS -h {CSf -h {C/t)CS{2e, L) 
e-^yCjt) + {C/t)CS + 2e-E’^CS + {CSf + {C/t)CS{2e, L) + ' 


(S62) 
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Table 5: All possible binding configurations and the corresponding energies for a combinatorial 
regulation setup implementing an AND gate. Each gene has two binding sites which bind two 
different cognate TF types. The “configuration" column lists all the configurations of the two bind¬ 
ing sites of a gene. 'C' denotes binding by cognate factor, 'N' - binding by noncognate and 'U' - 
means that the site is imbound. We distinguish between binding of noncognate molecules of the 
same type {N^N^) and different types (N^Ny). The "activity" column denotes whether in the given 
configuration the gene is either ON or OFF. To implement the AND gate, we assume that tran¬ 
scription occurs (ON) only when both the binding sites are boimd. The next four columns denote 
whether this configuration is counted as crosstalk (-I-) or not (-). In the leftmost column "ON", both 
the cognate transcription factors are present (and the gene should be ON). In the next three "OFF" 
columns, at least one of the cognate TFs is absent (and the gene should be OFF). In "C can be X" 
column, the cognate TF of only the left binding site (X) is present, in "C can be Y", the cognate TF 
of only the right binding site is present, and in "C can be none" column, both the cognate TFs are 
absent. Blank space denotes a non-existing configuration: these are the configurations including a 
cognate factor boimd in the situation that it is absent. The column "Energy" specifies the energy of 
these configurations. We define the reference energetic level = 0 as the state 'CC' when both sites 
are bound by their cognate factors, such that all other energies are positive. The column "Weight" 
denotes the statistical weight of the configurations, taking into account the concentrations of the rel¬ 
evant TFs and the energy of the configurations. Note that the statistical weight of the last binding 
configuration uses S{2e, L) instead of the usual 5'(e, L). 
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For genes that don't have any of their two cognate TFs present (M — Q — Qi out of M genes), 
the per-gene crosstalk error is 


(C5)2 + {C/t)CS{2e,L) 

Xnone - 2e-E.cS + (C'S')2 + {C/t)CS{2e, L) + ' 

The total crosstalk is: 


(S63) 


_ Q Qi 

3!^both “T ^ ^one \ 


Q + Qi\ 


(S64) 


For a given M and / and for each (Q, S) pair, we compute the optimal concentration C* numer¬ 
ically, and obtain the minimal crosstalk -^comb- 

As plotted in Fig. lS251 the boundaries between different regimes shift in the combinatorial setup. 
In particular, while at small / the "regulation regime" shrinks in the (Q, S) plane, as / increases, it 
expands. As / increases towards 1, the boundary between the "regulation regime" and "C = 0" 
regime moves towards larger S. In Fig. IS26I we have plotted the difference in optimal crosstalk 
between combinatorial regulation and the basic activation setup. For / = 0.001, combinatorial 
regulation doesn't improve from the basic activation setup in terms of optimal crosstalk. But for / = 
0.01,0.1, and 1, combinatorial regulation gives a lower optimal crosstalk than the basic activation 
setup. So, there exists a threshold in / such that for combinatorial regulation below that threshold, 
the "regulation regime" shrinks in comparison to the basic activation setup and performs worse. 
Above the threshold, the "regulation regime" expands towards larger S and gives a lower optimal 
crosstalk than the basic activation setup. At the baseline parameters of Q = 2500, M = 5000 and 
log(S) = —10.5, optimal crosstalk for the combinatorial setups reads as = 0.28,0.18,0.11 

and 0.07 for / = 0.001,0.01,0.1 and 1 respectively, compared to X* = 0.23 for the basic activation 
setup. 

This decrease in crosstalk is consistent with the reduction in the number of regulatory compo¬ 
nents (T and t, the number of TFs, see Fig. IS27II , as discussed in SI Section 1.5. In the case of perfect 
combinatorial regulation (/ = 1), we have roughly •\/2M instead of M TF species in the basic ac¬ 
tivation setup, which is a significant reduction in the number of regulatory components. Hence, 
each TF now effectively controls 0 = M/ y/2M = a/M/ 2 genes, and so the decrease in crosstalk 
is expected to be roughly compared to the basic activation setup. For M = 5000 genes, this 
would suggest that perfect combinatorial regulation could decrease the crosstalk by ~ 7-fold over 
the basic model. The actual reduction in crosstalk (from 0.23 to 0.07) isn't as large because of cer¬ 
tain differences between the combinatorial setup and 0-genes setup of SI Section [T31 One major 
difference is that in the 0-genes setup, the cell can only activate sets of genes of size 0, while in 
the combinatorial setup, the cell has the power to activate single genes at will, albeit at the cost of 
partially activating genes that aren't needed (since a considerable fraction of genes that should be 
OFF must have one of the two activators present) and allowing new non-cognate configurations. 
Fundamentally, therefore, crosstalk reduction comes from the decrease in the number of regulatory 
components (TF species) needed in the system, which again points to the explosion in the num¬ 
ber of possible noncognate interactions as the crucial origin of the crosstalk. In other words, what 
qualitatively seems to matter is 0, the number of regulated genes per TF, while the detailed man¬ 
ner in which these TFs regulate is less important for the actual numerical value of crosstalk (but is 
important for the functioning of the cell; e.g., in combinatorial regulation genes can be addressed 
individually, while in the model of SI Section lThl they carmot be). 
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Figure S25: Different regimes in the {Q, S) plane for the basic and combinatorial setup. Shifts in 
the regime boundaries in the basic activation setup vs. the combinatorial regulation setup. In the 
leftmost panel, we show the regimes for the basic activation setup. In the other panels, we show 
the regimes for the combinatorial setup for / = 0.001,0.1, and 1, respectively, from left to right. 
For / = 0.001, the "regulation regime" is slightly smaller than in the basic activation setup. As / 
increases, the "regulation regime" increases in size (and is bigger than in the basic activation setup) 
and the boundary with C = 0 is pushed higher towards larger S. 


We also note that while near-ideal combinatorial regulation appears to be a useful strategy to 
reduce the crosstalk, studies of scaling laws in gene regulatory networks do not appear to be con¬ 
sistent with the use of such a pure combinatorial strategy. In particular, the number of TFs scales 
at least linearly (quadratically, in prokaryotes) with the total number of genes Il47l across different 
organisms, while an efficient combinatorial strategy would suggest sub-linear (e.g., square-root) 
scaling. This clearly does not preclude the use of combinatorial regulation in some regulatory el¬ 
ements, but does show that even with the possible utilization of the combinatorial strategy the 
observed growth in the number of distinct TF species (which seems to be an important crosstalk 
parameter) is extensive. 


8 Weak global repressor 


So far we only considered gene regulation by activators. Cells however also have repression mech¬ 
anisms as an additional means of regulation. As a first step to account for that we incorporate 
in the model one type of an abundant weak global repressor that interacts with all binding sites 
with sequence-independent low affinity. Non-specific repression mechanisms such as the nuclear 
envelope, histones and DNA methylation are thought to mitigate spurious transcription 1251 . It 
was h 5 rpothesized that their emergence enabled the genome expansion in the transitions between 
prokaryotes to eukaryotes and from invertebrates to vertebrates l25l . We include an additional 
molecule in the model, which is found in concentration Cr and can bind all binding sites equally 
well with energy 0 < < Ea, namely it is more favorable than the rmbound state, but not as 

favorable as the specific cognate activator of each site. Hence, our intuition was that such a global 
repressor cannot compete equally with specific binding, but it can reduce non-specific binding. The 
crosstalk expressions now read: 


SC + Cre-^- + e-^“ 
SC+^+ Crc-E- + e-^- 


SC 

SC + Cre-^- + ■ 


(565) 
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Figure S26: Difference in minimal crosstalk between combinatorial setup and the basic activa¬ 
tion setup for different /. Panel (a) shows / = 0.001, where combinatorial regulation underper¬ 
forms the basic regulation setup. (b,c,d) Increasing values of / (/ = 0.01,0.1,1, respectively) can 
lower the crosstalk relative to the basic setup. At baseline parameters (Q = 2500, M = 5000 and 
log {S) = —10.5), minimal crosstalk for the combinatorial setups reads = 0.28,0.18,0.11 and 

0.07 for / = 0.001,0.01,0.1 and 1 respectively, compared to X* = 0.23 for the basic activation setup. 
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Figure S27: Scaling of the typical number of TFs present (t) and number of interactions per TF 

(riint) as a function of Q for different /. For each /, for Q smaller than some threshold value 
which depends on /, the number of TFs t varies as Q = 2t and the number of interactions per TF 
n is constant at 1. For all Q greater than this threshold value, log n increases linearly with log Q {n 
changes with Q in a power-law fashion). 


As before, we minimize the crosstalk with respect to the TF concentration. The optimal concen¬ 
tration is now: 


^GR — ~ 


Q{Cr 


r,-Er 


e-^“) - S{SMQ - Q{SQ + 2) + M)) 

S {-M{SQ + 1)2 + SQ^iSQ + 3) + Q) 


(S67) 


This is the same optimal concentration C* as in Eq. jSllb only scaled by a factor Cre~^^ -i 
instead of there. We conclude that the mere effect of a global repressor is to scale down the 
concentration of the specific activator. This is simply compensated for by a larger concentration 
of the activator. Flence, regardless of the global repressor affinity Er and concentration Cr this 
additional regulatory mechanism cannot lower the crosstalk beyond what is possible with specific 
activators only. As before, the minimal crosstalk is: 

^ -Q)+ 2^S{M - Q)) . (S68) 


9 Regulation by a combination of specific activators and specific 
repressors 

As the global repressor examined in the previous section did not show any additional improvement 
in crosstalk, we elaborate the model further to account for specific repressors, in similarity to the 
specific activators. We extended the basic model (Section[T]| in which a gene had a single regulatory 
site and was regulated by an activator alone, to a more general model in which each gene has 
two regulatory sites: one compatible with a specific activator binding and the other with a specific 
repressor. We assume that each gene has a unique activator and unique repressor. In the basic 
model (Section[T]|, for a gene to be silent its binding site should be vacant. The only way to achieve 
this was to lower the activator concentration. On the other hand, to improve activation reliability. 
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the activator concentration, should be increased! Thus, in the simple model there seemed to be 
a trade-off between reliable activation and elimination of undesirable activation. The existence of 
a specific molecule that blocks the site from binding of other (potentially activating) molecules is 
thought to be a more reliable way to prevent undesired gene activation, not at the expense of the 
activation of other genes l58ll . 

To be consistent with the basic model, we assume that the total concentration of all TPs (activa¬ 
tors and repressors together) is constant C. As before, Q genes need to be activated for which Q 
specific activators are present. The other M — Q genes need to be silent for which we now add their 
M — Q specific repressors. All activators are found in equal concentrations CaIQ = o * C/Q each. 
All repressors are in equal concentrations Cii/{M — Q) = {1 — a) * Cj (M — Q) each. We allow for 
different binding energies for the two binding sites Ea and ■ We assume that activation can only 
occur by binding of an activator molecule to the 'A' site. Repression is asymmetric in the sense that 
binding of any molecule to the repressor site prevents binding regardless of what is bound to the 
activator site. Thus a gene can only be active if the repressor site is empty and the activator site is 
bound by an activator. See the list of all possible states of the two binding sites in Tables 0 and [7| 
below. 

9.1 Overlapping activator and repressor binding sites 

For some genes, the regulatory sites of the activator and repressor partially overlap. Another pos¬ 
sibility is "negative cooperativity" - when one molecule repels the other. The outcome of either op¬ 
tion is that either an activator or a repressor could be bound at any given time, but not both of them 
simultaneously. In Tables |6][7| all the states above the double horizontal line are such that only one 
site can be bound at any given time ('overlapping sites'). The additional states below the line are 
only possible if both sites can be bound simultaneously ('non-overlapping sites'). Fig lS28l illustrates 
the dependence of crosstalk on the energy E^. (energy gap between unboimd and repressor-bound 
states) for different values of co-activated genes Q. Crosstalk is minimized for E^. = Ea exactly 
when Q = M — Q, meaning equal number of activated and repressed genes. However, for other 
values of Q ^ M — Q, Er is also not significantly different from Ea- 
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U,U 

OFF 

-H 

Ea + Er 

f,-(Ea+Er) 

2 

\J,Ca 

ON 

- 

Er 


3 

U, Na 

* 

-H 

Ef “h cd 

CaSe-^- 

4 

U, Nn 

OFF 

-H 

Ef “1“ cd 

0(1 - a)Se-^- 

5 

Ca,U 

OFF 

-H 

Ea -i ed 

^aSe-^^ 

6 

Na,\J 

OFF 

-H 

Ea -i ed 

C^aSe-^-^ 

7 

Nr,\J 

OFF 

-H 

Ea -i ed 

e 

1 

1 

o 

8 

{Na, Ca),Ca 

OFF 

-H 

cd 

(Car c 

Q * 

9 

Ca,Na 

OFF 

-H 

e(di -|- d 2 ) 

(Caf c<2 Q-1 

Q Q 

10 

Nr, Ca 

OFF 

-H 

cd 

^Sa{\ — a) 

11 

{Na, Nr),Na 

OFF 

+ 

e(di -|- d 2 ) 

r^2 c>2„,Q-1 Q-a 
^ “ Q Q 

12 

{Nr, Na, Ca),Nr 

OFF 

+ 

e(di -|- d 2 ) 

C^S^{l-a) 


Table 6: All possible binding configurations, corresponding energies and statistical weights for 
a two-binding site (A,R)-model: a gene that needs to be activated (hence its cognate activator is 
present and its cognate repressor is absent). The subscripts 'A' and 'R' refer to activator and re¬ 
pressor. We assume that the site to which the molecule binds determines the activity state, where 
binding to A-site can activate the gene and binding to the R-site (even if it is an activator!) hin¬ 
ders activation. 'C' denotes binding by cognate factor, N - binding by noncognate and U - site 
is unbound. Ea and are the energy gaps between unbound and cognate-bound states of the 
corresponding binding sites. In the upper part of the table (above the double line) we enumerate 
only states possible when both sites carmot be bound simultaneously (simplified model). If the two 
sites can be boimd simultaneously, there are additional binding configurations, which are detailed 
below the line. The column 'crosstalk if ON' lists all binding configurations that were accounted 
for as crosstalk in xi calculation - in this case all except for no. 2 (U, Ca)- 
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configuration 

(R-site,A-site) 

activity 

crosstalk 
if OFF 

Energy 

Weight 

1 

U,U 

OFF 

- 

Ea + Er 

g-{Ea + Er) 

2 

Cr,\J 

OFF 

- 

Ea 

C(l-a) E„ 
M-Q ^ 

3 

iVmU 

OFF 

- 

Ej- H- cd 


4 

Nr, U 

OFF 

- 

Ej- cd 

C’S'(1 - a)e-®“ 

5 

U, Na 


+ 

Ea ~\~ ed 

CSae-^- 

6 

u, {Cr, Nr) 

OFF 

- 

Ea “t ed 

CS{1 - a)e-^- 

7 

Cr, {Cr Nr, Na) 

OFF 

- 

Ea “t ed 

C(l-a} ^ q 

M-Q 

8 

Nr, {Cr Nr, Na) 

OFF 

- 

ed 

C^s^il-a^) 

9 

Na, {Cr Nr) 

OFF 

- 

e(di + d 2 ) 

C^S^{l-a^) 

10 

Na,Na 

OFF 

- 

ed 

C^S^a^ 


Table 7: All possible binding configurations, corresponding energies and statistical weights for a 
two-binding site (A,R)-model: a gene that needs to be silent (hence its cognate repressor is present 
and its cognate activator is absent). All notation is the same as in Tablejb] The column 'crosstalk if 
OFF' lists binding configurations that were accounted for as crosstalk in X 2 calculation - in this case 
only no. 5. 
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Figure S28: Activator-repressor overlapping binding sites, different Q values. E* - the energy 
gap between unbound and repressor-bound states - that minimizes crosstalk depends on the num¬ 
ber of co-activated genes Q. Here we show numerical results for the minimal crosstalk A* as a 
function of the repressor binding affinity Er (with constant activator affinity Ea = 15) for different 
numbers of co-activated genes Q, in the model where activator and repressor binding sites over¬ 
lap. We find that when the number of co-activated genes decreases (so that more genes need to be 
repressed) the optimal repressor affinity E* increases, so that repressors more effectively bind their 
cognate binding sites and eliminate spurious transcription. When the number of genes that need 
to be activated equals the numbers of genes that need to be repressed Q = M — Q, we obtain that 
full symmetry between activator and repressor E* = Ea provides minimal crosstalk - this case is 
shown in the main text. Fig. 5. Parameters: M = 5000, S = 10“"'^'^. 
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